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Abstract 


The  purpose  of  this  report  is  to  outline  the  major  concepts  and  developments  in  the  area  of  fault-tolerant 
computing.  Both  hardware  and  software  fault  tolerance  issues  are  addressed.  The  topics  covered  include 
module,  function  and  system-level  fault  detection  methods,  redundancy  and  reconfiguration  strategies,  valid 
fault  models,  and  coding  and  checking  in  computer  systems.  Software  fault  tolerance  methods  such  as 
recovery  blocks,  design  diversity,  and  checkpointing  and  recovery  are  also  discussed.  Major  issues  in  modeling 
and  evaluation  of  fault-tolerant  systems  are  outlined.  The  design  of  two  successful  commercial  systems  is 
discussed. 
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1.  Introduction 


The  study  of  fault-tolerant  computing  has  paralleled  the  development  of  modem  computers.  One  of  the  very 
early  contributions  was  due  to  von  Neumann  [1],  the  designer  of  the  first  stored  program  machine.  Von 
Neumann’s  work  addressed  the  question  of  synthesizing  reliable  computers  from  unreliable  components  and 
developed  the  ideas  of  redundancy  and  replication  that  are  common  in  many  computers  today.  Strong  impetus  for 
fault  tolerance  came  from  the  space  program  in  the  early  1960’s.  There  was  the  need  to  build  systems  that  would 
survive  without  naintenance  for  extended  periods  of  time.  Manned  space  Sights  provided  a  further  boost  to  fault 
tolerance.  Reliability  technique1:  advanced  rapidly  during  this  period.  In  this  initial  period,  interest  in  fault  toler¬ 
ance  largely  remained  the  domain  of  the  space,  defense  and  telephone  industries.  With  the  rapid  introduction  of 
computers  intc  all  areas  of  science,  business  and  the  humanities,  that  domain  of  interest  has  broadened 
significantly.  High  reliability  and  availability  have  become  critical  for  efficient  functioning  of  our  modem  society. 
In  this  regard,  the  development  of  VLSI  techniques  has  provided  a  major  impetus  to  the  advancement  of  fault- 
tolerant  computing  VLSI  designs  have  made  replication  and  redundancy  both  cost  effective  and  practically  feasi¬ 
ble. 


In  the  past  twenty  years,  fault-tolerant  computing  has  matured  into  a  broad  discipline  encompassing  many 
aspects  of  computer  design.  This  article  is  intended  to  provide  the  reader  with  an  overview  of  the  different  thrust 
areas  which  encompass  hardware  and  software  fault  tolerance.  Three  factors  drive  the  interest  in  fault  tolerance: 
first  is  the  need  for  high  reliability:  second  is  the  need  for  high  availability.  (For  example.  AT&T’s  ESS  switch¬ 
ing  systems  have  an  availability  requirement  ot  less  than  2  minutes  of  down-time  per  year.)  Third  is  the  direct 
impact  of  a  loss  in  reliability  on  system  performance  (also  referred  to  as  performability). 

This  article  is  divided  into  six  sections.  Section  2  deals  with  the  broad  subject  of  hardware  fault  tolerance. 
Important  characteristics  of  hardware  fauit-ioierancc  techniques  such  as  hardware,  intormatton  and  time  redun¬ 
dancy  and  the  development  of  self-checking  circuits  are  discussed.  The  question  of  software  fau.t  tolerance  is  dis¬ 
cussed  in  Section  3.  This  is  an  important  area  since  software  failures  are  fast  becoming  the  dominant  failure  mode 


m  complex  computer  systems.  The  Apollo  and  shuttle  missions  aborted  due  to  software  sauits.  RcccntK.  sections 
■f  the  VT&T  network  were  virtually  paralyzed  due  to  a  software  bug.  Section  4  addresses  the  questions  testing 


and  design  for  testability.  Basic  to  the  design  of  fault-tolerant  systems  is  the  availability  of  defect-free  parts. 
Efficient  testing  strategies  are  critical  to  determine  the  presence  of  defects  and  faults.  Section  5  addresses  the 
question  of  evaluation,  an  issue  which  is  critical  from  both  the  designer  and  the  user  perspective.  Methods  and 
tools  to  determine  the  dependability  of  the  overall  system,  aid  to  make  comparative  evaluations  are  discussed. 
Both,  analytical  and  measurement-based  methods  are  outlined.  The  final  section  discusses  the  design  of  two  suc¬ 
cessful  commercial  fault-tolerant  systems. 


2.  Hardware  Fault-Tolerance  Techniques 

A  reliable  computer  system  needs  to  provide  its  normal  level  of  service  in  the  presence  of  hardware  and 
software  faults  [2].  There  arc  two  philosophies  of  achieving  this  reliability:  (1)  fault  avoidance,  which  is  any  tech¬ 
nique  to  prevent  the  occurrence  of  faults  in  the  first  place;  (2)  fault  tolerance,  which  is  any  technique  to  allow  the 
system  to  behave  normally  despite  the  occurrence  of  faults.  Fault  tolerance  can  be  implemented  using  one  of  two 
basic  approaches:  (1)  fault  masking,  where  the  system  masks  the  effect  of  a  fauit  through  some  form  of  majority 
voting;  (2)  fauit  detection  and  recovery,  where  a  system  has  a  method  for  first  detecting  the  presence  of  a  fault, 
subsequently  locating  where  the  fault  has  occurred,  next  isolating  the  fault,  reconfiguring  in  a  spare,  and  restarting 
the  system  [3j. 

This  section  will  describe  fault-tolerance  techniques  for  hardware  faults.  Hardware  fault-tolerance  can  be 
achieved  through  the  use  of  some  form  of  redundancy  [4]:  hardware  redundancy  (such  as  spare  hardware);  time 
redundancy  (such  as  repeating  operations  in  time  on  existing  hardware);  information  redundancy  -such  as  some 
form  of  coding);  algorithmic  redundancy  (modifying  the  algorithm  running  in  parallel  hardware  with  extra  steps;; 
and  software  redundancy  ("such  as  extra  lines  of  software  code).  We  will  discuss  the  basic  techniques  used  in  each 
of  these  approaches  for  hardware  fault-tolerance  in  both  uniprocessor  and  multiprocessor  svsiems. 
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2.1.  Hardware  Redundancy 


There  are  three  basic  fonns  of  hardware  redundancy:  passive,  active  and  hybrid  [4|.  Passive  hardware 
redundancy  relies  on  voting  mechanisms  to  mask  the  occurrence  of  faults  by  using  the  concept  of  majority  voting. 
They  do  not  need  fault  detection  or  system  reconfiguration.  The  most  common  form  of  passive  redundancy  is 
called  triple  modular  redundancy  (TMR)  which  triplicates  the  hardware  necessary  to  perform  the  required  opera¬ 
tions  and  uses  a  voter  to  determine  the  output  of  the  system.  In  this  approach,  the  primary  difficulty  is  the  voter. 
If  that  faiu,  the  entire  system  fails.  A  common  approach  to  avoid  this  problem  is  to  use  three  voters  and  provide 
three  independent  outputs.  Figure  1  shows  the  two  fonns  of  TMR.  Se\  era!  stages  of  TMR  can  be  interconnected 


(a)  TMR  with  one  .-oser 


i':i  TMR  .v;:!-  3  vnisrs 

FIGURE  1.  Triplication  and  voting. 
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using  this  approach  by  connecting  the  outputs  of  the  voters  of  one  TMR  stage  via  the  inputs  of  modules  of  the 
next  TMR  stage.  The  voting  can  be  performed  by  either  a  hardware  voter  (which  can  perform  the  voting  very 
fast,  but  requires  a  lot  of  extra  hardware  logic)  or  a  software  voter  (which  is  performed  on  some  existing  proces¬ 
sors  performing  normal  computations  as  well,  but  this  approach  is  generally  slow).  A  generalization  of  the  TMR 
approach  is  N-moduiar  redundancy  (NMR)  which  uses  N  copies  of  a  module  instead  of  3.  The  NASA  Space 
Shuttle  onboard  computer  system  uses  four  computers  on  which  a  majority  vote  is  performed. 

Active  hardware  redundancy  attempts  to  achieve  fault  tolerance  by  fault  detection,  fault  location,  and  fault 
recovery.  The  most  common  form  of  fault  detection  is  duplication  and  comparison  which  uses  two  identical 
copies  of  hardware,  having  them  perform  the  same  computations  in  parallel,  and  comparing  the  results  as  shown  in 
Figure  2.  One  of  the  commercial  products  from  Stratus  Computers  uses  a  pair-and-spare  approach  where  two 
duplexed  components  are  used  for  self-checking  and  fault  tolerance.  Two  processor  boards  are  used,  where  each 
board  contains  a  pair  of  microprocessors  used  in  duplicate  and  compare  mode  to  check  themselves. 

Another  form  of  fault  detection  includes  off-line  fault  diagnosis,  which  involves  applying  a  set  of  test  :npu. 
patterns  to  various  components  of  the  system  and  comparing  the  outputs  to  the  expected  outputs  for  each  com¬ 
ponent  Other  forms  of  fauit  detection  include  periodically  interleaving  normal  computations  with  diagnostic  tests, 
or  using  self-checking  hardware  as  will  be  described  later. 


second  form  of  active  redundancy  is  called  standby  sparing  where  one  module  is  operational  and  one  or 


more  modules  serve  as  standbys,  or  spares.  Various  fault  detection  schemes  are  used  to  determine  when  a  module 
has  become  faulty,  and  fauit  location  is  used  to  determine  exactly  which  module  is  faulty.  The  reconfiguration 
operation  in  standby  sparing  can  be  viewed  conceptually  as  a  switch  whose  output  is  sclcccd  from  one  of  the 
modules  providing  inputs  to  the  switch.  Standby  sparing  can  bring  a  system  back  into  full  operauon  after 
occurrence  of  a  fault,  out  it  requires  that  a  momentary  disruption  in  performance  occur  while  reconfiguration  is 
performed.  If  the  disruption  of  processing  must  be  minimized,  hot  standby  sparing  can  be  used,  where  the  spares 
operate  synchronously  with  the  on-line  modules,  and  are  prepared  to  take  over  at  any  time.  Cold  standby  sparine 


uses  unpowered  spare 


: he  advantage  ot  cold  sparing  is  that  spares  do  not  consume  power  until  needed  to  replace  a  f.iuitv  module.  A  key 


Figure  2.  Duplication  and  comparison. 


advantage  of  standby  soaring  is  that  in  a  system  containing  n  identical  modules,  such  as  a  multiprocessor,  fault 
tolerance  can  be  provided  with  k  <  r.  spare  modules. 


H'/brid  hardware  redundancy  combines  the  attractive  features  of  both  the  active  and  passive  approaches. 

Fault  r;,  iking  is  used  to  prevent  the  system  from  producing  erroneous  results,  and  fault  detection,  location  and 
reco  * ;  y  are  used  to  reconfigure  the  system  in  the  event  of  a  fault  The  most  common  form  of  hybrid  redundancy 
is  that  *'  ■  loduiar  redundancy  with  spares,  in  this  approach,  a  basic  core  of  n  -  ■xiules  is  aminccd  .n  .1  v.u-nn 

conii'  .ration.  In  addition,  spares  are  provided  to  replace  faulty  units  in  the  NMR  core. 
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2.2.  Informatios  Reduodaacy 


In  form  au  oo  redundancy  is  tie  addition  of  redundant  inform  anon  to  data  to  allow  fault  detection ,  fault  mask¬ 
ing.  aad  fault  tolerance.  Examples  of  information,  redundancy  are  error  detecting  and  correcting  codes  (ECC).  A 
code’s  error  detection  and  correction  properties  are  based  on  its  ability  to  partition  a  set  of  2"  /t  -bit  words,  each 
n -bits  wide,  into  a  code  space  of  2"  words  aid  a  noncode  space  of  2*-2"  words.  Each  code  is  constructed  such 
that  a  given  number  of  errors  transforms  a  code-space  word  into  a  word  in  a  noncode  space.  Errors  are  detected 
by  decoding  circuits  that  identify  any  word  outside  the  code  space.  Error  correction  is  performed  by  more  exten¬ 
sive  decoding  that  uniquely  associates  a  noncode  space  word  with  the  original  code  word  transformed  by  the 
Cii  ors. 


An  example  of  an  error  detecting  code  is  the  parity  code,  where  given  an  n  -bit  word,  one  attaches  an  extra 
bit  to  convert  it  to  an  even  or  odd  parity  word.  Any  single  bit  error  in  the  parity  coded  word  will  be  detected  by  a 
simple  decoding  circuit  using  a  set  of  XOR  gases. 


Wimin  a  single  word,  the  number  of  errors  detectable  or  correctable  is  related  to  the  minimum  separation  or 
Hamming  distance  between  the  words  of  a  code  space,  which  is  the  minimum  number  of  bit  positions  by  which 
two  words  from  the  code  space  differ.  Codes  that  neeo  to  detect  d  errors  need  to  have  a  Hamming  distance  of 
d~r  1.  and  codes  that  need  to  correct  c  errors  need  a  Hamming  distance  of  2c -rl.  Error  detection  and  correction 


codes  vary  widely  in  detection  and  correction  properties,  encoding  and  decoding  complexity,  and  code  efficiency. 

The  most  -commonly  used  codes  are  the  pantv  check  codes  that  are  characterized  by  the  parity  <.hcck  matrix. 

H.  For  example,  consider  a  length-o  code,  n  =  6,  with  three  information  bits,  k  =  3,  and  three  check  bits,  r  =  3. 


The  two  H  matrices  in  Figure  3  provide  the  same  error-correcting  property.  Since  all  the  columns  arc  distinct,  the 
code  can  correct  ail  single  bit  errors,  but  the  parity  check  circuit  for  Hi  is  less  complex  than  H%.  H\  requires 
three  2-mput  XORs  to  compute  the  parity  checks,  whereas  //;  requires  two  2-tnpat  XOP.s.  and  one  --input  XOR. 
hence  the  encoder/dscoder  for  the  H ?  code  will  be  slower  and  more  complex. 


In  high-speed  memories,  single-bit  error-correcting  and  double -bit  crror-detcctine  -SECOED*  r; 

most  commonly  used.  This  is  because  most  semiconductor  RAM  chips  are  organized  for  one  hit  of  Asia  output  jc 
a  ume.  therefore  the  faiiure  of  one  chip  manifests  itself  as  one-bit  error.  Paritv  codes  are  used  routine!',  m 
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Figure  3.  Parity  check  matrices  fcr  two  simple  codes. 


computers  to  check  errors  in  busses,  memory  and  registers.  Cyclic  redundancy  checks  are  used  to  detea  errors  in 
communication  channels,  -apes,  snd  disks.  M-oul-of-N  codes  detect  errors  in  control  store  memories.  Arithmetic 
codes  detect  errors  tn  arithmetic  units  like  adders  and  nuitipbers.  For  more  infarmaucn  about  coding  in  reliable 
computer  systems,  the  reader  is  referred  to  [6.  “j. 


Se!f*c becking  logic  designs  use  the  error  detecting  codes  and  some  extra  hardware  to  detect  faulty  logic  cir¬ 
cuits  that  could  be  single  points  of  failures  in  a  svstem.  Each  relf-chcckine  circuit  has  coded  :nmss  and  outputs. 


A  circuit  is  classified  as  fault-secure  if.  for  any  spec i tied  fauit  within  the  circuit, 
incorrect  output  code  word  when  stimulated  by  a  correct  input  code  word.  A  self-re 
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word  for  at  least  one  code  word  input  for  each  possible  fault  A  totally  self-checking  circuit  has  properties  of  both 
fault-secure  and  self-testing  circuits  [8],  Sell-checking  circuits  have  been  used  widely  in  the  AT&T  Electronic 
Switching  Systems  3A  processors. 

2  J.  Time  Redundancy 

The  basic  concept  of  time  redundancy  is  the  repetition  of  computations  two  or  more  times  and  comparing 
the  results  to  determine  if  a  discrepancy  exists.  If  an  error  is  detected,  the  computations  can  be  perform*  again  to 
see  if  the  disagreement  remains  or  disappears.  Such  approaches  are  good  for  detecting  errors  due  to  transient 
faults,  but  cannot  protect  against  errors  resulting  from  permanent  faults. 

Another  form  of  time  redundancy  to  handle  permanent  faults  modifies  the  way  the  computations  are  per¬ 
formed  the  second  time.  One  approach  uses  alternating  logic  for  self-dual  combinational  circuits  [9],  which  per¬ 
forms  a  function  on  some  set  of  inputs  in  one  time  instant,  and  performs  the  same  function  on  the  complemented 
input  in  a  subsequent  time  step,  the  output  of  which  should  be  the  complement  of  the  original  function  value  of 
the  original  input.  If  the  second  value  of  the  function  is  not  the  complement,  an  error  is  detected. 

The  second  approach  uses  recomputing  with  shifted  operands  [10],  which  is  applicable  to  bit-sliced  organi¬ 
zations  of  hardware.  In  the  first  time  step,  the  normal  computation  is  performed  on  the  operands  and  the  results 
stored  in  a  register.  In  the  next  time  step,  the  operands  are  shifted  left  by  £bits,  and  the  output  is  shifted  right  by 
k  bits  and  compared  with  the  result  of  the  previous  computation.  Any  error  in  k-\  consecutive  bit  slices  of  an 
arithmetic  or  logical  operation  will  be  detected  by  this  method.  The  additional  hardware  requirement  is  the  three 
shifters,  the  storage  register  to  hold  the  results  of  the  first  computation,  and  the  comparator.. 

A  variant  of  this  method  is  called  recomputing  with  swapped  operands,  where  in  the  first  ume  step,  the 
operation  is  performed  in  the  normal  form.  In  the  following  time  step,  the  upper  and  lower  halves  of  the  operands 
are  swapped  such  that  a  faulty  bit  slice  operates  on  opposite  halves  of  the  operands  in  two  computations.  The 
additional  hardware  requirements  are  in  the  form  of  several  multiplexers,  a  storage  register  and  a  comparator. 
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2.4.  Algorithmic  Redundancy 


A  relatively  new  approach  to  fault  tolerance  is  the  use  of  algorithm-based  fault  tolerance  which  is  useful  in 
developing  low-cost  techniques  for  error  detection  and  fault  tolerance  in  parallel  processor  systems  while  perform¬ 
ing  specific  computation-intensive  applications  [11].  Contrary  to  conventional  data  encoding,  which  is  done  at  the 
word  level  in  order  to  protect  against  errors  which  affect  bits  in  a  word,  in  algorithm-based  approaches,  data  is 
encoded  at  a  higher  level.  This  encoding  can  be  done  by  considering  the  set  of  input  data  to  the  algorithm  and 
encoding  this  set.  The  original  algorithm  must  then  be  redesigned  to  operate  on  this  encoded  data  and  to  produce 
encoded  output  data.  The  redundancy  in  the  encoding  would  enable  the  correct  data  to  be  recovered  or,  at  least,  to 
rec.'j.mzv  that  the  data  are  erroneous.  This  technique  has  been  applied  to  systolic  arrays  performing  a  variety  of 
computations  such  as  matrix  operauons,  Fast  Fourier  Transform,  matrix  equation  solvers,  sorting,  QR  factoriza¬ 
tion,  recursive  least  squares,  filtering,  and  singular  value  decomposition  [12]. 

We  illustrate  the  application  of  an  algorithm-based  checking  technique  by  an  example:  the  muluplication  of 
two  N  x  N  matrices.  In  the  checksum  encoding,  an  extra  row  and  an  extra  column  are  appended  to  the  original 
matrix,  which  are  the  sums  of  the  elements  of  the  columns  and  rows,  respectively  [11].  After  the  matnx-matnx 
multiplication  is  performed,  the  result  matrix  also  preserves  the  checksum  property.  If  there  is  an  error  in  the 
result  matrix  element  (i  J),  it  will  be  identified  by  verifying  the  equality  of  the  sum  of  the  row  elements  with  the 
checksum  for  row  i ,  and  by  verifying  the  equality  of  the  sum  of  the  column  elements  with  the  checksum  for 
column  j .  Once  the  erroneous  element  is  identified,  the  correct  element  can  be  reconstructed  by  taking  the  sum  of 
all  elements  of  that  row  (column)  except  the  erroneous  element  and  subtracting  this  sum  from  the  row  (column) 
checksum.  This  is  illustrated  in  Figure  4  which  shows  a  5  x  4  row  checksum  encoded  matrix  multiplied  by  a  4  x 
5  column  checksum  encoded  matrix  on  a  5  x  5  processor  array  having  row  and  column  broadcasting  capability  to 
produce  a  5  x  5  full  checksum  matrix. 

Recently  algorithm-based  checking  techniques  have  been  applied  on  more  general-purpose  multiprocessors 
such  as  hypercubes  [13].  Studies  on  actual  measurements  of  various  algorithms  on  a  hypercube  have  revealed  that 
it  is  possible  to  get  very  high  error  coverages  (90-95%)  for  detection  at  relatively  Sow  cost  il()-I5r;  time  over¬ 
head). 
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2J.  Software  Redundancy 

In  applications  that  use  programmable  computers,  many  fault-detection  techniques  can  be  implemented  in 
software  as  several  extra  lines  of  code  to  verify  the  consistency  of  a  result,  such  as  to  check  the  magnitude  of  a 
signal..  A  consistency  check  uses  a  priori  knowledge  about  the  characteristics  of  informauon  to  verify  the  correct¬ 
ness  of  information.  In  Randell,  [14]  such  checks  are  application  specific. 


Capability  checks  are  often  performed  to  verify  that  a  system  possesses  the  capability  expected.  For  exam¬ 
ple,  if  a  processor  has  the  privilege  of  reading  or  writing  to  a  set  of  regions  in  memory  under  the  presence  ot  an 


addressing  fault.  Another  form  of  capability  check  is  used  to  verify  if  a  ; 


can  execute  a  specific 
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instruction  on  specific  data. 


3.  Software  Fault  Tolerance 

Software  plays  a  crucial  role  in  a  computer  system’s  ability  to  tolerate  design,  manufacturing,  and  wear-out 
faults.  Faults  in  software  are  typically  due  to  problems  in  design  or  implementation,  while  faults  in  hardware  can 
be  due  to  design,  manufacturing,  wear-out,  or  environmental  upsets.  This  section  presents  'an  overview  of  the 
ways  in  which  software  design  and  implementation  techniques  can  be  used  to  detect  and  tolerate  both  software 
design  errors  and  hardware  faults. 

The  development  of  highly  r.f liable  software  necessitates  more  than  just  software  fault-tolerance  techniques. 
The  development  process  must  include  rigorous  application  of  fault  avoidance  approaches,  which  include  the 
correct  use  of  formal  specification  languages,  structured  programming,  formal  proof  of  correctness,  and  extensive 
testing  at  all  levels  of  implementation.  Desip  for  fault  avoidance  is  a  necessary  prerequisite  for  effecuve  software 
fault  tolerance  [2].  Software  fault  tolerance  addresses  the  issues  of  detecting  and  recovering  from  residual  design 
and  implementation  errors  in  the  software  and  detecting  and  recovering  from  wear-out  and  environmentally- 
induced  hardware  faults. 

3.1.  Detection  and  Recovery  from  Software  Faults 

The  fundamental  approach  to  detecting  software  desip  errors  is  through  exploiting  diversity.  Diversity  in 
implementation  and  desip  can  be  in  the  form  of  acceptance  tests,  executable  assertions,  alternative  software 

modules,  or  full  diversity  through  desiping  and  implementing  multiple  versions  of  the  complete  software  b> 
different  teams  of  software  engineers.  Diversity  can  be  captured  through  encoding  knowledge  of  the  expected 
behavior  at  various  levels  of  the  software  and  then  comparing  what  is  expected  against  what  is  observed.  This 
encoding  of  knowledge  can  be  at  the  level  of  the  process  outputs,  intermediate  results,  system  behavior,  or 
expected  algorithm  behavior.  The  two  primary  approaches  to  software  fault  tolerance  that  provide  a  ompicte 
framework  for  capturing  diversity  in  both  design  and  implementation,  as  well  as  providing  formal  mechani-wu- 
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error  detection,  error  containment,  and  recovery,  are:  (1)  recovery  blocks  [14];  and  (2)  N-version  software  [15]. 


3.1.1.  Recovery  Blocks 

Recovery  blocks,  as  developed  by  Randell  [14],  implement  diversity  in  the  form  of  acceptance  tests  and 
alternative  software  modules.  Software  is  partitioned  hierarchically  into  self-contained  modules  called  "recovery 
blocks."  Each  recovery  block  validates  its  own  operation  and  either  returns  correct  results  or  notifies  the  system 
of  an  error.  As  illustrated  in  Figure  5  [16],  each  recovery  block  is  composed  of  an  acceptance  test,  the  primary 
alternative  software  module,  and  the  secondary  software  modules.  The  acceptance  test  is  used  to  determine  the 
correctness  of  a  software  module’s  results  (error  detection)  and  the  alternative  modules  provide  recovery  from  a 
detected  error.  Diversity  can  be  captured  in  both  the  acceptance  test  and  the  secondary  alternative  software 
modules. 


Error  Recovery 
Block 


Figure  5.  The  recovery  block  approach  to  software  fault  tolerance. 
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An  example  of  the  acceptance  test  and  alternative  modules  employed  in  recovery  blocks  can  be  seen  in  the 
following  sorting  algorithm  described  by  Randell  [14], 

ensure  sorted  (S) A  (sum(S)=sum (prior  S)) 
by  quickersort(S) 
else  by  quicksorts) 
else  by  bubblesoit(S) 
else  error 

The  acceptance  test  should  verify  that  none  of  the  elements  have  changed  and  that  the  elements  are,  indeed,  sorted. 
The  primary  algorithm  can  be  the  most  efficient  preferred  algorithm,  while  the  alternatives  may  be  less  efficient 
and  are  invoked  only  if  the  primary  module  results  in  an  errer. 

Note  should  be  made  that  the  recovery  block  approach  can  also  be  implemented  with  distributed  or  parallel 
architectures  in  which  the  alternatives  are  initiated  in  parallel  with  the  primary  module.  Also,  full  design  diversity 
can  be  implemented  if  a  formal  specification  forms  the  basis  of  each  recovery  block  and  diverse  programming 
teams  develop  the  alternative  modules  and  diverse  acceptance  tests. 

3.1.2.  N  Version  Programming 

The  N  version  programming  approach  to  fault  tolerant  software  has  been  extensively  described  by  Avizienis 
(151.  N  version  programming  differs  from  recovery  blocks  in  that  it  employs  design  diversity  at  the  software  sys¬ 
tem  level  through  designing  and  implementing  multiple  (N)  versions  of  the  software  with  different  teams  of  pro¬ 
grammers.  Instead  of  employing  an  acceptance  test,  N  version  programming  utilizes  voters  to  reach  a  consensus 
of  two  or  more  outputs  among  the  N  member  versions.  This  approach  necessarily  must  rely  on  diversity  in  the 
design  to  detect  programming  errors  in  the  N  versions  of  the  software.  If  diversity  is  not  enforced  in  the  design 
and  implementation,  there  may  be  an  undetected  or  unrecoverable  failure  due  to  a  single  cause.  Both  the  recovery 
block  and  N  version  programming  approaches  require  a  reliable  execution  environment  for  voung  or  executing 
assertions  '  nd  for  time-efficient  execution  of  the  software  modules. 
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3.1.3.  Error  Detection  Techniques 


Although  recovery  blocks  and  N  ver  on  programming  are  the  best  known  approaches  that  provide  complete 
frameworks  for  software  fault  tolerance  for  programming  errors,  there  are  also  a  wide  variety  of  individual  tech¬ 
niques  that  are  commonly  employed  outside  of  these  frameworks.  Examples  of  these  techniques  include  accep¬ 
tance  tests  and  executable  assertions  that  are  commonly  used  to  detect  anomalies  due  to  either  programming  errors 
or  hardware  failures  [18-22].  Fail-stop  tests,  such  as  timers  for  detecting  time-out  conditions,  are  also  common. 

3.2.  Software  Approaches  to  Detection  and  Recovery  from  Hardware  Faults 

3.2.1.  Masking  and  Voting 

The  N  version  programming  approach  is  direcdy  applicable  to  detection  and  recovery  of  hardware  faults 
when  the  multiple  versions  are  executed  on  different  hardware  units.  This  is  a  variant  of  the  classic  NMR  (TMR, 
form  of  fault  tolerance  a  -  described  earlier  in  this  article.  It  is  possible  that  hardware  faults  can  be  tolerated  even 
without  diversity  if  the  hardware  and  software  are  replicated  N  times  and  the  voter  is  designed  to  be  fault  tolerant. 
However,  the  application  cf  design  diversity  to  both  the  N  software  and  N  hardware  units  can  provide  a  line  of 
defense  against  software  programming  faults,  hardware  design  faults,  environmental  upsets,  and  wear-out  fauits 
[3]. 

3.2.2.  Assertions  and  Alternative  Execution 

The  recovery  biock  approach  is  also  direcuy  applif.'-M"  -n  detection  and  recovery  from  nardware  failures,  as 

■v e i  as  from  programming  faults.  As  descnbed  elscwin.it.  in  this  article,  recent  algc  thm-'peciiic  techniques  ;or 
encoding  inputs  and  checking  expected  outputs  (algorithm  and  behavior-based  fault  tolerance)  have  been 
developed  for  detection  and  recovery  from  hardware  faults.  These  algorithm-specn.c  jpprcaC.V»\>  .jc  a  combina¬ 
tion  of  hardware  .md  -;of”.varc  fault  tolerance  in  that  they  employ  algorithm  modification  fv>r  Jetcction  and 
recovery  :rr,m  hardware  failures. 
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3 .2-2.1,  Fault-Tolerant  Data  Structures 


Linked  data  structures  provide  a  specific  example  of  specialized  techniques  for  error  detection  and  recovery. 
The  initial  work  concerning  detection  of  '  *n  links  (structural  integrity)  utilizing  redundant  links  was 
developed  by  Taylor.  Morgan  and  ‘  “  aion  and  correction  algorithms  for  data  structures,  when  used 

concurrently  with  nom.al  data  structure  a  '•  /pi^ally  degrade  performance.  If  data  structure  checking  opera¬ 
tions  are  performed  in  a  small  locality  aroun.  „urrently  accessed  node  then  error  detection  and  correction  can 
potentially  be  performed  concurrently  with  data  structure  accesses  without  severely  degrading  the  system 

performance.  In  addition,  an  arbitrary  numbes  of  errors  in  the  data  structure  may  be  detected  and  corrected  assum¬ 
ing  not  too  many  errors  exist  within  '  giver,  locality. 


One  example  of  such  a  technique  for  detecting  and  correcting  structural  errors  in  data  structures  is  the  vir¬ 
tual  backpointer  [24].  The  virtual  backpointer  provides  the  capabilities  of  structural  error  detection  and  correcuon 
as  well  as  the  generation  of  backpointer  values  used  in  backward  traversals.  The  Virtual  Double-Linked  L.st 
(VDLL)  is  a  uniform  data  structure  that  employs  the  virtual  backpointer  for  local  error  detection  and  correcuon 
md  for  backward  traversals.  The  VDLL  requires  the  same  storage  space  Os  me  standard  double-linked  list  (DLL;, 
and  retains  the  simplicity  of  die  DLL.  since  it  is  possible  to  move  directly  from  a  node  to  its  parent,  using  the  vir¬ 
tual  backpointer.  In  addition,  the  VDLL  has  enhanced  error  detection  and  correction  capabilities  over  the  DLL. 
An  example  of  the  VDLL  is  shown  in  Figure  6.  In  addition  to  the  normal  forward  pointer,  a  virtual  backpointer  is 
Stored  in  each  node.  The  virtual  backpointer  is  a  function  of  the  address  of  the  previous  (back'  node  and  the 
current  forward  pointer.  It  can  be  shown  that  it  is  possible  to  detect  any  two  errors  in  a  VDLL  and  correct  *y 
single  error  for  forward  moves. 


3.2.3.  Ree.xevitlou  Through  Checkpointing  and  Rollback 

Checkpointing  is  an  important  technique  for  recovery  after  error  detection  by  means  of  rollback  recxecuuon 
o!  a  process.  Checkpointing  schemes  can  be  broadly  classified  as  full  or  incremental  checkpointing.  The  form,  r 
-aves  the  entire  active  state  space  of  a  process  while  the  latter  saves  the  difference  between  me  current  .md  a  previ¬ 
ous  state  space.  A  checkpointing  scheme  can  be  implemented  at  the  system  or  application  level. 


lb 


Figure  6.  Example  VDLL  robust  data  structure. 

Research  on  classic  cfc  .-pointing  and  rollback  recovery  has  been  extensive  [16-201.  Graph-theoretic 
methods  by  which  the  programmer  can  decide  where  to  insert  checkpoints  have  been  developed.  The  program  is 
decomposed  by  the  programmer  into  a  sequence  of  tasks  between  which  the  checkpoints  can  be  inserted.  It  is 
assumed  that  the  execution  time,  the  checkpoint  lime,  and  the  recovery  time  of  each  of  these  tasks  •*  ■r-'y.vn  in 
advance.  With  this  information,  the  algorithms  can  determine  the  optimal  places  to  insen  checkpoints  so  that  the 
maximum  checkpoint  time,  the  expected  checkpoint  time,  or  the  expected  run  time  is  minimized. 


3.2J.1.  Compiler-Assisted  Checkpointing 

Compiler-assisted  techniques  for  implementing  a  checkpointing  scheme  have  recently  been  developed  [25]. 
This  approach  can  achieve,  in  some  instances,  both  programmer  transparency  and  reduced  checkpoint  size  without 
modification  of  the  hardware  or  operating  system.  Compiler-generated  sparse  potential  checkpoint  .otic  is  used  to 
maintain  the  desired  checkpoint  interval.  The  compiler-assisted  approach  to  checkpointing  uuii/cs  several  meas¬ 
urement  and  a-  ptive  learning  techniques  to  exploit  periodic  reduction  in  memory  requirements  to  reduce  the  size 
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of  checkpoints,  when  possible  [25]. 


323.2.  Error  Detection  and  Recovery  in  Distributed  Systems 

Fault  tolerance  in  distributed  systems  has  long  been  the  focus  of  extended  research  [26-29].  Practical  appli¬ 
cations  of  this  research  include  distributed  databases  and  real-time  systems.  Important  elements  in  fault-tolerant 
distributed  systems  include  reliable  communication  and  synchronization  protocols,  reliable  storage  media  and 
storage  algorithms  (repliettion),  and  reliable  individual  processing  nodes  [.16]. 

One  of  the  important  concepts  in  distributed  systems  is  that  of  atomic  actions  and  commit  protocols.  They 
are  used  to  ensure  the  completion  or  rollback  of  transactions.  Nested  transactions  have  been  proposed  as  a 
mechanism  for  encapsulating  the  synchronization  and  failure  properties  of  distributed  systems  [26].  Recovery'  in 
distributed  database  systems  is  often  implemented  through  rollback  of  transactions  and  use  of  shadow  paging  or 
undoing  a  write  ahead  log.  Network  partitioning  and  data  replication  have  also  been  used  to  tolerate  node  failures 

[29] .  Examples  of  protocols  developed  to  deal  with  data  partitioning  and  replication  include  weighted  voting, 
majority  consensus,  and  quorum-based  commit  [26]. 

3.1'3.3.  Recovery  through  Checkpointing  and  Rollback  in  Shared  Memory  Parallel  Multiprocessors 

Since  different  processors  in  a  shared  memory  multiprocessor  system  can  access  the  same  memory  space,  a 
rollback  of  one  process  in  multiprocessor  systems  may  require  a  rollback  of  another,  as  well.  It  has  recently  been 
shown  that  through  appropriate  modification  of  cache  coherence  protocols,  periodic  checkpointing  of  the  cache 
contents  can  be  made  into  the  shared  memory  in  such  a  way  that  a  consistent  shared  memory  state  is  maintained 

[30] ,  The  consistent  shared  memory  state  ensures  that  oni,*  the  process  encountering  the  error,  resulung  from  a 
processor  transient  fault,  is  involved  in  the  rollback  recovery  at  the  point  of  error  detecuon  and  no  rollback  propa¬ 
gation  is  required.  Without  rollback  propagation,  rapid  rollback  recovery  is  thus  achieved  simply  by  invalidating 
the  cache  contents  and  then  restarting  the  process  from  the  checkpoint  after  reloading  the  program  counter  and 
registers. 


18 


In  the  multiprocessor  cache-based  checkpointing  approach  there  are  two  instances  in  which  a  process  has  to 
be  checkpointed.  The  first  instance  occurs  whenever  a  cache  block  modified  since  the  last  checkpoint  is  to  be 
written  back  to  the  shared  memory,  which  happens  when  a  cache  block  is  replaced  on  a  cache  miss.  The  second 
instance  occurs  when  another  processor  is  to  read  a  dirty  block  modified  in  a  processor’s  cache  since  its  last 
checkpoint.  Checkpointing  is  initiated  by  the  cache  controller  in  hardware  and  is  transparent  to  system  or  applica¬ 
tion  software.  Checkpointing  a  process  includes  flushing  the  cache  blocks  modified  since  the  last  checkpointing 
session  and  saving  the  processor  internal  registers. 

Once  a  processor  error  is  detected,  all  cache  blocks,  except  those  that  are  unwritable  in  the  private  cache  of 
that  processor  are  invalidated.  The  processor  internal  registers  are  reloaded  and  execution  is  restarted.  Cache 
misses  occurring  when  a  processor  resumes  execution  are  serviced  by  data  from  the  global  checkpoint  which  is 
stored  in  the  shared  memory  and  caches  of  other  processors.  The  cache  coherence  protocol  enforces  delivering  the 
correct  version  of  data  if  another  cache  has  a  block  which  matches  the  miss. 

To  integrate  the  multiprocessor  cache-based  checkpointing  scheme  into  cache  coherence  protocols,  one  extra 
state  for  a  cache  block  ,s  introduced.  A  modified  cache  block  is  split  into  two  classes:  writable-modified  and 
unwritable.  Figure  7  illustrates  the  Illinois  cache  coherence  protocol  [30],  which  has  been  modified  by  adding  one 
state  to  incorporate  the  cache-based  checkpointing  scheme. 

There  have  been  numerous  recent  developments  in  implementing  shared  memory  programming  environ¬ 
ments  on  distributed  memory  multiprocessors.  Typically  called  distributed  shared  memory,  such  environments 
utilize  memory  coherence  protocols  to  implement  the  shared  memory  paradigm  in  software.  Memory  coherence- 
based  checkpointing  techniques  have  been  developed  for  distributed  shared  memory,  similar  to  those  for  cache- 
based  checkpointing  [31].  A  checkpoint  occurs  by  an  individual  processor  if  a  page  of  memory  is  requested  by 
another  processor  that  has  been  modified  since  the  last  checkpoint.  Rollback  is  implemented  hy  -amply  invalidat¬ 
ing  all  local  pages  and  restoring  processor  registers. 

The  advantage  of  most  checkpointing  schemes  that  are  embedded  in  the  memory  management  proto-  el  is 
that  they  arc  transparent  to  the  application  programmer.  They  potentially  can  be  implemented  mi  a-  to  minimize 
performance  degradation.  However,  their  disadvantage  is  that  it  is  not  easy  to  change  or  control  the  trequcncy  ol 
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Figure  7.  Cache  coherence  protocol  incorporating  cache-based  checkpointing, 
checkpointing  with  such  approaches.  In  general,  an  integrated  approach  applying  a  hierarchy  of  checkpoinung  stra¬ 
tegies  (memory  management,  operating  system,  and  application  level)  is  neces  ary  to  effectively  address  the 
shortcomings  of  individual  techniques. 


4.  Testing 

Testing  is  the  process  of  discovering  fau1  defects  or  malfunctions  in  the  system  under  test  The  test  pro¬ 
cess  consists  of  exercising  the  system  with  a  set  of  test  vectors  and  analyzing  its  response  for  correctness.  Design 
of  a  reliable  computer  system  involves  many  levels  of  abstractions.  Typical  levels  of  abstractions  from  lowest  to 
the  highest  are:  logic  ievel,  register  level,  instruction  set  level,  processor  level  and  system  level.  Testing  closely 
follows  these  levels  of  abstraction  and  separate  testing  procedures  apply  to  each  level.  The  stimuli  and  responses 
defining  a  test  experiment  use  the  information  being  processed  relevant  to  each  level.  For  example,  testing  at 
logic  level  involves  binary  vectors  or  vector  sequences.  Similarly,  higher  levels  of  testing  involve  machine 
instructions,  arithmetic  numbers,  textual  data,  messages,  procedure  and  application  programs.  Thus,  testing  is  a 
very  broad  term  encompassing  many  different  activities  and  environments.  Testing  theory  and  practice  is  most 
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mature  at  the  logic  level.  Testing  at  logic  level  is  well-defined  and  rigorous.  Testing  becomes  less  precise  with 
increasing  levels  of  abstraction.  At  the  system  level  testing  is  largely  ad  hoc  and  based  on  intuition  and  experi¬ 
ence. 


Testing  begins  at  the  semiconductor  chip  manufacturing  level  and  continues  at  higher  levels  of  assembly  and 
* 

packages,  such  as  printed  circuit  boards,  the  hardware  system  and  finally  the  complete  system  including  the 
operating  system  and  application  software.  The  initial  purpose  of  testing  at  chip  level  is  for  diagnosis  of  chip 
defects.  The  diagnosis  process  identifies  what  is  specifically  wrong  with  a  bad  chip.  The  information  is  valuable 
in  fine-tuning  the  semiconductor  fabrication  process.  Defects  at  this  level  consist  of  open  or  shorted  wires  and 
transistors,  slow  transistors,  too  high  a  power  consumption,  weak  drivers  and  so  forth.  Once  the  process  is  fine- 
tuned  and  mature,  the  main  purpose  of  chip  level  testing  becomes  that  of  sorting  good  chips  from  bad  chips.  This 
is  where  the  most  rigorous  testing  takes  place.  The  cost  of  testing  a  chip  is  a  significant  fraction  of  the  ovcraii 
cost  of  manufacturing.  However,  the  cost  of  testing  the  same  chip  at  higher  levels  is  even  higher.  Experience 
shows  that  a  defective  chip,  escaping  as  a  good  chip  due  to  an  imperfect  testing  procedure,  costs  10  times  as  much 
to  test  on  a  printed  circuit  board.  The  cost  includes  increased  difficulty  in  testing,  and  locating  and  replacing  the 
bad  chip  on  the  board.  If  the  defective  chip  escapes  detection  at  the  board  testing,  it  costs  UK)  tunes  as  much  to 
locate  and  fix  at  the  processor  testing  level.  The  cost  of  a  defective  chip  increases  10  fold  at  every  ievci..  Tnere- 
fore,  the  goal  of  chip  tesung  is  a  near  perfect  differentiation  between  a  good  chip  and  a  defective  chm  t'32j. 


Testing  continues  after  a  system  becomes  operational  and  deployed  in  the  field.  The  purpose  of  testing  in 
the  field  can  be  of  a  preventive  nature  or  can  be  for  repair  of  an  unopera fional  system.  Testing  for  preventive 
maintenance  locates  defective  chips  or  boards  which  have  not  been  exposed  in  the  normal  operation  of  the  sys¬ 
tem.  Such  ancxnoscd  defects  arc  called  latent  failures.  Eventually  they  will  be  cxposca  hy  -omc  -a  -aem  opera¬ 
tion.  In  highly  reliable  systems,  latent  failures  reduce  the  fault  tolerance  capabilities  of  the  system  yme.  :f  oich 
failures  arc  allowed  to  accumulate.  Therefore,  highly  reliable  systems  require  periodic  testing  to  slush  out  latent 
taiiurcs.  Testing  for  diagnosis  and  repair  consists  of  narrowing  down  the  location  of  a  tault  to  a  :ie!d-rcp‘accabie 
mu  FRIT.  An  FRL  for  a  computing  system  is  typically  a  board  or  a  multi-chip  msxiuic.  Locating  the  t.ii.it  o 
the  chip  boundary  is  -cry  difficult  and  expensive  and  therefore  repair  of  the  board  is  postponed  until  the  Vjf,i  car. 
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be  returned  to  a  repair  facility.  The  testing  discipline  can  be  broadly  classified  into  these  fields:  Fault  Modeling, 
Fault  Simulation,  Automatic  Test  Pattern  Generation  (ATPG),  Design  and  Synthesis  of  Testable  Circuits,  and 
Built-In  Self-Test  (BIST). 


4.1.  Fault  Modeling 


For  testing  purposes,  all  physical  faults  are  absoacted/modeied  to  a  level  appropriate  for  the  component 
under  test  At  the  chip  level  the  most  widely  accepted  fault  model  is  that  of  line  stmk-at-0  or  stuck-at-L  In  prac¬ 
tice,  the  stuck-at  fault  model  is  further  restricted  to  single  stuck-at  fault  model,  meaning  the  test  —ocedurc 
assumes  that  there  is,  at  most,  one  fault  in  the  circuit.  This  assumption  reduces  the  test  generation  complexity. 
Physical  defects  such  as  an  open  connection,  or  a  short  to  ground  or  power  can  be  modeled  as  constant  logic  0  or 
1  in  many  situations.  Also,  experience  shows  that  even  when  some  physical  defects  do  not  behave  as  stuck-at 
faults,  me  test  patterns  that  detect  all  single  stuck-at  faults  also  detect  a  large  percentage  of  such  physical  defects. 
Another  advantage  of  the  stuck-at  fault  model  is  that  it  is  a  logic  model  and  therefore  many  of  the  results  from 
Boolean  algebra  can  be  used  in  the  test  generation  and  analysis  of  such  faults.  For  this  reason,  stuck-at  fault  is  a 


widely  accepted  fault  model  for  most  digital  circuits.  The  stuck-at  fault  model  is  useful  in  proving  structural 
integrity  of  a  circuit  within  the  constraint  of  Boolean  equivalence.  There  are  also  other  logic  fault  models  that 
have  become  very  important  for  highly  reliable  components.  A  short  between  two  wires  is  called  a  bridging  fault. 
which  is  logically  equivalent  to  a  wired  AND  or  OR  depending  on  technology  and  relative  ekctricai  strengths  of 
two  opposing  polarity  signals  on  tire  short.  Other  physical  faults,  such  as  resistive  shorts  and  opens,  trapped 
charges  in  gate  oxide  of  MOS  transistors  and  weak  transistors,  can  adversely  affect  the  original  timing 
specification  of  a  circuit  Such  faults  are  modeled  as  deiavjauiis. 

The  next  higher  level  of  abstraction  for  fault  models  is  broadly  called  functional  faults.  A  functional  fay!;  is 
simply  an  incorrect  execution  of  a  function,  cor  specific  functions  these  incorrect  behaviors  can  be  narrowed 
down  For  example,  a  restricted  fault  model  for  an  address  decoder  might  say.  "for  input  address  i  the  faulty 
decoder  incorrectly  selects  address  j.'  A  slightly  more  general  model  might  jy.  'the  decoder  selects  -«nh:nc.  >r 


selects  j.  or  sciects  >  and  f  and  a  very  general  functional  fault  model  mieht  sav.  "the  .feender 


"ects  ,.ro  sunset 
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(including  null)  of  all  addresses."  Similarly  functional  fault  models  have  been  derived  for  many  other  functions 
such  as,  adders,  multipliers,  arithmetic  and  logic  units  of  a  processor,  PLAs,  micro-sequencers,  instruction  set  pro¬ 
cessors  and  memories.  We  give  two  more  examples,  one  for  an  n-bit  adder  and  the  second  for  an  instruction  set 
processor. 

A  restricted  functional  fault  model  for  an  n-bit  adder  is  that  the  sum  differs  by  ±2‘.  Such  a  model  can  be 
derived  from  the  assumption  that,  at  most,  one  internal  carry  or  an  output  sum  is  faulty.  If  we  extend  the  physical 
faults  to  a  full  adder  in  a  ripple-carry  adder  implementation,  then  it  can  be  shown  that  the  functional  fault  model 
for  the  n-bit  -adder  is  that  the  sum  differs  by  (±2*  ±2>+u  in  the  presence  of  a  fault. 

For  a  processor,  a  widely  accepted  functional  fault  model  is  that  when  instruction  I  has  to  be  executed,  the 
processor:  (a)  does  not  execute  any  instruction;  or,  (b)  it  executes  some  other  instruction  J;  or,  (c)  it  executes  I 
and  some  other  instruction  J.  Such  abstraction  allows  one  to  generate  a  test  oet  for  the  processor  without  the  struc¬ 
tural  gate  level  information. 

The  higher  the  level  of  abstraction  of  the  fault  model,  the  more  generally  applicable  it  becomes,  independent 
of  specific  implementations.  At  any  level  of  abstraction  the  fault  model  can  be  very  general  or  very  restrictive. 
Very  general  fault  models  give  a  higher  degree  of  confidence  in  the  quality  of  the  test  set,  in  the  sense  that  the  test 
set  will  cover  a  large  number  of  physical  failures  and  a  broad  class  of  failures  (e.g.,  stuck-at,  bridging  and  delay). 
Restrictive  functional  fault  models  correspondingly  give  a  lower  quality  of  test  sets.  More  informauon  can  be 
found  in  [33,  341. 

4.2.  Fault  Simulation 

Fault  simulation  consists  of  simulating  a  circuit  in  the  presence  of  faults.  The  most  common  fault  model 
used  by  fault  simulators  is  the  single  stuck-at  fault.  However,  in  theory  any  fault  model  can  be  used  during  simu¬ 
lation.  By  comparing  the  outputs  from  simulation  of  fault-free  circuits  with  the  faulty  circuit  one  can  determine  if 
the  fault  is  detected  by  the  applied  test.  For  a  given  test  set,  T,  a  fault  simulation  produces  the  list  of  faults  which 
are  detected  by  T.  The  .umber  of  detected  faults  expressed  as  a  percent  of  all  faults  is  called  fault  coverage. 
Fault  coverage  is  a  meas-ire  of  the  quality  of  a  test  set.  The  process  of  finding  the  fault  coverage  of  T  is  called 
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fault  grading  the  test  set  T.  \  perfect  test  set  will  have  100%  fault  coverage  for  the  assumed  fault  model.  A 
component  passing  100%  tes.  .  at  may  still  have  other  faults  not  covered  by  the  assumed  fault  model.  There  are 
several  applications  of  fault  simulation.  Fault  simulation  is  used:  1)  in  fault  grading  a  given  test  set;  2)  in  diag¬ 
nosis  of  a  faulty  circuit;  3)  in  automatic  test  generation,  and;  4)  in  verification  of  error  detection/correction  cir¬ 
cuits  in  highly  reliable  systems. 

Fault  grading  a  test  set  is  the  most  common  use  of  fault  simulators.  A  test  set  derived  using  a  higher  fault 
model  may  be  graded  for  lower  fault  models.  For  example,  circuit  designers  use  functional  verification  tests  to 
check  for  design  errors.  The  same  tests  are  often  used  to  test  the  circuits  for  stuck-at  faults.  Therefore,  the  fault 
simulation  is  used  to  evaluate  the  effectiveness  of  a  functional  test  as  a  stuck-at  test.  If  the  fault  coverage  is  not 
saUsfactory,  the  designer  can  add  more  functional  vectors  to  improve  the  fault  coverage.  Thus  indirectly,  the  fault 
simulator  is  used  to  generate  test  vectors  for  a  circuit 

In  the  diagnosis  applications,  a  fault  simulator  is  used  to  generate  fault  dictionaries.  A  fault  dicuonary  is  a 
list  of  faults  detected  for  each  tes:  vector.  Additionally,  a  fault  dictionary  may  also  store  the  actual  output 
response  for  each  fault  or  a  compressed  version  of  the  response,  called  a  signature  of  the  fault.  The  diagnosis  pro¬ 
cess  (idenufication  and  locauon  of  the  fault)  relies  «;n  matching  the  response  (or  signature)  from  the  circuit  under 
test  to  the  simulated  response  (or  signature)  stored  in  the  dictionary. 

Fault  injection  experiments  are  an  important  aspect  in  the  design  and  verifc..aon  c;  ughi.  r  liable  circuits. 
Hardware  experiments  are  slow  and  expensive  and  very  limited  in  the  types  of  fault,  iha*  cc.:  L  ijt":cd.  Fault 
simulators  on  the  other  hand  are  easy  to  use  and  are  flexible  in  terms  of  fault  type,  location  and  method  of  injec¬ 
tion.  Fault  simulators  can  also  evaluate  the  effects  of  several  thousand  faults  in  a  single  pass. 

The  simplest  method  of  simulating  faults  is  the  serial  fault  simulation.  It  consists  of  taking  the  fault-free 
circuit  and  transforming  it  into  a  faulty  circuit  by  injecting  one  fault  and  then  simulating  the  circuit  with  a  >tan- 
dard  logic  simulator.  The  main  advantage  of  this  method  is  that  no  special  fault  simulator  code  is  needed.  In 
addition,  it  can  simulate  just  about  any  type  of  fault.  However,  the  serial  method  is  very  ume -consuming,  consid¬ 
ering  the  fact  that  a  10,000  gate  circuit  can  have  close  to  50,000  single  stuck-at  faults.  This  time  can  be  reduced 
by  appropriately  simulating  many  faults  simultaneously.  Three  well-known  methods  are:  1)  Parallel  Fault  Simu- 
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lation;  2)  Concurrent  Fault  Simulation,  and;  3)  Deductive  Fault  Simulation. 


Parallel  fault  simulation  exploits  the  word  parallel  operations  of  a  computer  by  using  each  bit  in  the  word  to 
represent  a  different  fault.  Thus,  one  can  simulate  32  to  256  faults  in  a  single  pass  depending  on  the  machine 
used.  Another  efficiency  added  to  most  practical  parallel  fault  simulators  is  the  use  of  event-drive'  simulation 
techniques.  Experience  shows  that  a  fault  causes  only  a  few  logic  values  to  change  from  fault-free  'aiues.  there¬ 
fore,  an  event-driven  fault  simulator  will  need  to  execute  very  few  events  (gate  evaluations;.  Concurrent  fault 
simulator  is  also  an  event-driven  fault  simulator.  It  keeps  all  faulty  machine  states  but  only  simulates  differences 
between  a  fault-free  and  faulty  machine.  Deductive  fault  simulation  is  a  symbolic  simulation  method  and  it 
deduces  faulty  behavior  of  all  faulty  machines  in  one  pass  (subject  to  available  memory).  The  operations  used  in 
the  deductive  simulator  are  the  union  and  intersection  of  symbolic  fault  lists.  The  execution  speed  of  the  above 
three  methods  depend  to  a  large  degree  on  the  programming  techniques  and  as  a  result  are  hard  to  compare  purely 
based  on  methodology.  Parallel  is  the  easiest  to  implement  of  the  three.  Deductive  is  potentially  die  fastest  for 
stuck  faults  in  synchronous  sequential  circuits,  but  implementation  complexities  may  make  it  slower  than  parallel. 
The  concurrent  is  the  most  general  and  flexible  in  terms  of  extending  it  to  include  detailed  circuit  timing  any  type 
of  fault  behavior,  and  higher  level  functional  models.  For  further  reading  on  fault  simulation  see  (33,  35,  36*. 

4.3.  Test  Generation 

The  simplest  method  to  test  a  circuit  is  to  subject  it  to  random  test  patterns.  In  fact,  it  is  qu„e  an  acceptable 
method  for  many  circuits.  One  can  use  the  fault  simulator  to  calculate  the  fault  coverage  and  add  more  random 
vectors  if  the  coverage  is  not  sufficient  and  iterate  the  process  until  desired  coverage  is  reached.  However,  to 
achieve  a  high  fault  coverage  experience  shows  that  many  circuits  require  an  inordinate  member  of  random  pat¬ 
terns.  The  cost  of  fault  simulating  a  large  number  of  patterns  could  be  far  more  than  asme  a  non-random  algo¬ 
rithmic  method  of  test  generation  to  achieve  the  same  coverage.  There  are  several  .itich  dvHrnunisuc  test 
uon  methods.  Test  vector  of  a  stuck-at  fault  in  combinational  logic  implementing  a  Boolean  function  F  cn  .  he 
derived  by  taking  the  Boolean  function  F'  of  the  faulty  circuit  and  forming  F  XOR  F'.  ,t1?  -  ector  that  prodtiv.es 
F  XOR  F  -1  is  a  test  vector  tor  that  fault.  In  practice,  this  procedure  of  taking  symbolic  Boolean  functions  n| 


faulty  md  fault-free  circuks  anfi  forming  exclusive-or  is  quite  time-consuming  and  getting  a  vector  that  makes  the 
F  XOR  F'  function  z  1  is  a  V  .own  hard  problem.  Hfh^ynt  tr*  gene?  itors  are  based  on  „ne  of  the  wo  known  algo¬ 
rithms:  lj  Moth's  D  Algcruhm  and  2)  Goel’s  PODHM  algorithm.  Both  of  these  methods  use  a  5-vaiued  algebra 
(0,  l,D,D,  2ndX\  D  is  a  symbol  representing  a  logic  1  in  fault-free  and  logic  0  in  a  faulty  circuit.  Similarly, 
symbol  D  represents  logic  0  i(.  fault-free  and  logic  )  in  a  faulty  circuit.  X  is  an  unkr.o-.vn  value.  In  both  methods, 
the  objective  is  to  justify  logic  values  on  various  lines  in  the  circuit  to  accomplish  a/  Fanlt  Excitation  and  b)  Fault 
Propagation.  Fault  excitation  is  u.e  process  of  applying  a  logic  value  opposite  to  the  stuck- at  fault  value.  Fault 
propagation  is  the  process  of  applying  inputs  such  that  a  fault  effect  (i.e.,  signal  D  or  D)  is  propagated  to  an  output 
of  the  circuit.  The  ]>algonthm  assigns  a  oropriate  logic  values  iocaily  to  a  fault  site  and  then  makes  assignments 
lorward  cr  backward  in  the  circuit  tc  justify  the  assigned  value.  These  assignments  are  further  justified  by  more 
assignments  to  other  lines.  This  process  is  iterated  until  all  internal  assignments  are  justified  solely  by  primary 
input  assignments.  During  the  iustificauon  process  of  an  assigned  logic  value  on  a  line,  conflicts  of  signal  values 
may  arise  on  some  other  lines,  ii.  ahich  case,  the  assignment  must  be  undone.  This  is  a  systematic  trial  and  error 
procedure,  and  it  will  find  a  test  vector  if  one  exists.  In  PODEM,  assignments  are  made  only  to  primary  inputs. 
PODHM  is  a  branch-and-bound  search  method,  in  which  the  inputs  are  assigned  one  at  a  time  and  the  effect  of 
eacn  assignment  .s  propagated  before  another  primary  input  is  assigned.  If  the  effect  of  an  assignment  causes  a 
bounding  condition  then  the  assignment  is  backtracked,  and  rr  assigned  a  different  value.  In  both  PODEM  and  D- 
algorithm,  the  procedures  will  find  a  test  vector  if  one  exists  or  will  detcuitine  ‘hat  the  fault  can  not  be  detected. 
In  the  worst  case,  both  procedures  must  try  all  binary  combustions  of  the  irputs.  The  worst  case  rarely  happens 
in  real  circuits,  however,  the  procedure  can  sometimes  make  a  iaige.  number  of  backtracks.  A  large  number  of 
backtr;  ;ks  occur  mostly  for  faults  which  arc  not  detectable.  Undetectable  faults  ue  associated  with  redundant 
gates  or  lines  in  a  circuit  For  high  reliability  it  is  important  io  remove  any  unintentional  redundancy  in  the  cir¬ 
cuit.  There  are  no  other  algorithms  which  detect  redundancy  more  ’Ificienily  than  a  test  generauuii  aigoiuu....  As 
a  result,  test  generation  algorithms  are  also  used  in  midt'-levcl  circuit  minimization  procedures  to  remove  unneces¬ 
sary  gates. 

Test  veneration  for  synchronous  sequential  circuit  are  extensions  of  combinational  test  generation  aigo- 
rir-mu  The  ext-nsion  is  based  on  transforming  a  sequential  circuit  into  an  iterauve  combinational  logic  array  t,sec 
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Figure  8.  Time  frame  expansion  sequential  circuits.  x{t),  y(£)>  s(0.  are  signal  values  at  time  t 
C(0),C(1),C(2)  are  ai!  identical  emotes  of  the  combinational  portion  of  the  sequential  circuit. 
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Figure  8).  One  cell  of  the  array  is  called  a  time  j'rame.  In  this  transformation  each  flip-flop  is  modeled  as  a  com¬ 
binational  element  with  input  equal  to  current  state  and  output  equal  to  the  next  state  of  the  flip-flop.  The  iterative 
array  is  simply  a  very  deep  combinational  circuit.  The  number  of  cells  (time  flames)  in  the  array  correspond  to 
the  number  of  vectors  in  the  test  sequence.  It  can  be  shown  that  any  detectable  fault  can  be  detected  in  4°  vectors 
in  a  sequential  circuit  with  n  flip-flops.  Therefore,  the  worst  case  bound  on  ute  number  of  time  frames  needed  is 
4°,  which  is  generally  too  large  for  any  practical  circuit  Therefore,  during  the  test  generation  process  the  number 
of  time  flames  are  dynamically  expanded,  as  needed.  One  additional  complexity  for  sequential  circuits  test  gen- 
eratioi  is  t  at  of  initialization  of  flip-flops.  If  a  reset  sequence  is  given  then  it  simplifies  the  problem  somewhat. 
However,  one  has  to  be  careful  that  the  reset  sequence  can  become  invalid  in  the  presence  of  a  fault.  If  the  reset 
sequence  is  not  given,  the  test  genera 'or  must  find  such  a  sequence.  Sequential  test  generators  can  be  helped  by 
t.igh  level  tnformauon  such  as  state  transition  diagrams.  In  most  cases,  though,  the  state  transition  diagrams  are 
e>ther  too  large  or  not  given,  therefore  the  applicability  of  such  a  test  generator  is  limited. 

in  addition  to  cu.rbinationaJ  test  generator.,  there  are  simulation-based  test  generators,  which  is  a  trial-and- 
error  approtch.  First,  a  set  ot  trial  vectors  are  applied  and  fault  simulated.  Based  on  the  fault  simulation  results 
the  'best"  vector  is  re'amed  and  added  to  the  test  seqi-ence.  The  process  is  repeated  until  desired  fault  coverage  is 
reached.  The  "best"  vector  is  defined  by  some  cost  enteria  such  as,  number  of  new  faults  detected,  or  number  of 
flip-flops  set.  The  selection  of  the  trial  set  is  somewhat  random  but  can  be  constrained  to  meet  certain  uming 
requirements.  For  example,  we  can  restrict  the  successive  vectors  to  differ  by,  at  most,  one  bit  to  prevent  races  in 
asynchronous  circuits.  An  important  advantage  of  this  method  is  that  it  is  more  general  than  combinauond -based 
test  generators.  Any  circuit  and  any  fault  type  the  simulator  is  able  to  handle  is  acceptable  for  this  test  generation 
method.  One  disadvantage  is  that,  m  some  circuits,  to  achieve  very  high  fault  coverage  the  number  ot  inal  vectors 
that  need  to  be  simulated  can  be  very  large,  and  the  resulting  test  sequence  also  tends  to  be  very  long  compared  to 
the  deterministic  approach  described  above. 

Finally,  there  are  test  techniques  specific  to  certain  functions  such  as,  adders,  muiuplicrs,  iterative  logic 
arrays,  random  access  memories  (RAM),  associative  memories  and  microprocessors.  These  icchniqucs  are  termed 
t'w, cional  testing.  The  term  functional  testing  comes  from  the  fact  that  each  function  specific  test  procedure 
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assumes  a  very  precise  functional  fault  model  For  example,  for  adders  and  multipliers  which  are  made  of  3-bit 
full  adders,  the  fault  model  assumed  is  that  the  fault  affects  the  truth  table  of  full-adder  in  any  way.  If  one 
assumes  that  at  most  one-full  adder  is  faulty,  one  can  derive  tests  analytically  which  are  far  more  compact  than 
possible  with  automatic  test  generators.  Ripple  carry  adders  can  be  tested  with  8  test  vectors,  the  number  8  is  con¬ 
stant  independent  of  the  length  of  the  adder.  Such  regular  structures  with  a  constant  number  of  tests  independent 
of  the  number  of  cells  are  called  C-testable.  Ripple  carry  adders  and  two-dimensional  combinational  multipliers 
and  many  other  iterative  logic  structures  have  proven  to  be  C-testable. 

Difficulty  in  memory  testing  is  not  how  to  generate  a  test  set,  but  what  realistic  fault  models  to  use  and  how 
to  get  a  short  test  for  such  faults.  The  complexity  of  test  length  is  very  important  in  memories  because  of  the 
number  of  bits  involved  in  present  RAMs.  For  example,  if  a  test  length  grows  as  the  square  of  the  number  of  bits 
in  the  memory  then  a  1-Megabit  RAM  will  require  the  order  of  1012  test  vectors.  Most  commonly  used  functional 
fault  models  for  RAM  are:  bits  stuck-at  0  or  1;  faults  in  address  decoder  resulting  in  failure  to  address  a  bit; 
addressing  the  wrong  bit;  addressing  more  than  one  bit;  coupling  faults  between  two  bits  resulting  in  unwanted 
read-write  operation  on  a  coupled-bit;  pattern  sensitive  faults  resulting  in  failure  of  read  or  write  of  a  bit  in  the 
presence  of  a  specific  bit  pattern  in  the  neighboring  bits;  and  so  on.  In  memory  testing,  the  single  fault  assump¬ 
tion  is  not  used.  Furthermore,  no  upper  bound  on  the  number  of  faults  is  assumed.  For  all  of  the  above  functional 
fault  models  efficient  test  algorithms  have  been  derived.  Resources  for  more  information  on  lest  generauon 
r"«‘!hods  are  [33-38]. 

4.4.  Design  far  Testability 

In  spue  of  major  advances  in  test  gcncrauor.  and  fault  simulation  techniques,  teeing  of  figual  systems  still 
remains  a  very  difficult  problem.  Testing  cost  remains  a  significant  fraction  of  the  overall  'ost  of  manufacturin'; 
VLSI  chips.  The  complexity  of  test  generation  and  cost  of  testing  can  be  reduced  by  the  process  of  design  for  tes¬ 
tability  CDFT).  Two  important  factors  in  a  testable  circuit  are  controllability  and  observability  of  individual  nodes 
in  the  circuit.  Controllability  is  the  ability  to  establish  a  specific  signal  value  at  an  internal  node  m  a  _;rcu,t  by 
setting  values  on  fdirectly  accessible)  inputs.  Observability  .s  the  ability  to  determine  the  stunal  %'ulue  at  anv  node 
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by  selling  values  on  inputs  and  observing  subsequent  outputs. 


Most  DFT  techniques  either  resynthesize  an  existing  design  or  add  extra  hardware  to  the  design.  Resyn¬ 
thesis  systems  remove  most  redundancies  in  combinational  circuits.  In  sequential  circuits,  the  resynthesis  system 
encode  the  states  to  make  them  easier  to  reset,  control  and  observe.  All  DFT  methods  affect  the  original  desien  in 
terms  of  chip  area,  I/O  pins  and  speed.  The  goal  of  a  DFT  method  is  to  achieve  the  desirable  testability  with 
minimal  overhead.  The  cost  benefit  of  the  DFT  is  hard  to  quantify  in  real  money.  Since  the  DFT  benefits  are 
spread  over  many  factors  such  as  reduction  in  test  generation  time,  enhanced  quality  (fault  coverage)  and  hence 
reduction  in  return  rate  of  bad  parts.  It  can  also  affect  test  length,  test  application  time,  tester  memory,  diagnosis 
and  field  maintenance  time.  Because  of  a  lack  of  precise  quantitative  cost-benefit  analysis,  manufacturers, 
designers,  test  engineers  and  users  disagree  a  great  deal  in  their  assessment  of  cost  benefits  of  DFT. 

A  great  deal  of  testability  techniques  are  ad  hoc.  For  example,  adding  reset  lines,  partitioning  large  circuits, 
removing  redundancies,  inserting  control  points  and  observation  points  (test  points),  converting  asynchronous  to 
synchronous  logic,  breaking  long  feedback  paths,  breaking  long  counters  and  shift  registers  into  smaller  parts,  and 
so  on.  Figure  9  shows  test  point  insertion.  Addiuon  of  test  points  can  result  in  too  many  I/O  pins  on  a  chip.  The 
pin  overhead  can  be  reduced  by  employing  a  scan  register  to  control  and  observe  the  test  pins.  A  scan  register  is 
simply  a  shift  register  with  parallel  load  capability.  A  typical  scan  register  has  4  I/O  pins:  shift  data-in,  shift 
data-out,  load,  and  shift  clock.  Test  data  is  shifted  in  from  outside  and  applied  to  the  control  points.  The 
responses  are  captured  from  observation  points  into  the  scan  register  with  a  load  signal  and  then  shifted  out  for 
later  analysis.  The  scan  register  trades  off  test  point  I/O  pins  with  increased  area  and  increased  test  time. 

There  are  several  methods  to  select  test  points  in  a  circuit.  Some  methods  are  ad  hoc.  some  based  on 
analysis,  and  test  generauon  and  tault  simulation,  and  some  based  on  converting  sequential  circuits  into  combina¬ 
tional  logic.  Good  candidates  for  ad  hoc  selection  of  test  points  arc:  lines  with  Inch  fanouts,  giobai  feedback 
paths,  (intentional)  redundant  lines,  flip  flops,  addresses,  ‘lam  and  control  signals  of  embedded  memories,  and 
internal  clocks.  Analysis-based  methods  use  quantifiable  controllability  and  observability  measures,  or  other 
measures  such  as  sequential  depth,  ard  number  of  feedback  paths  passing  through  a  node.  These  are  used  in  pul¬ 
ling  test  points  at  the  ieast  controllable  and  observable  nodes.  A  global  optimization  program  wul  pin  test  points 
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Figure  0  Test  point  insertion  tin.  example  snows  ime  control  of  value  1  and  observation 
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such  that  the  overall  gain  in  controllability  and  observability  is  maximum.  Since  all  such  measures  are  approxi¬ 
mate,  the  effect  of  test  points  on  test  generation  time  and  fault  coverage  can  not  be  accurately  predicted.  All  of 
the  methods  rely  on  a  test  generator  and  fault  simulation  to  find  which  nodes  should  be  the  best  candidates  for  test 
points.  One  can  use  a  limited  trial-and-error  method  to  insert  the  test  points,  and  then  generate  a  test  to  accurately 
see  its  effect  One  can  combine  the  above  two  methods  by  first  selecting  a  few  test  points,  based  purely  on  struc¬ 
tural  analysis,  and  then  augmenting  it  by  more  test  points  selected  by  a  test  generator. 

When  test  points  are  inserted  at  all  flip-flops  and  only  at  flip-flops  in  conjunction  with  a  scan  register,  the 
methods  are  genetically  called  scan  designs.  If  the  flip-flops  are  master-slave  or  edge-triggered  then  additional 
flip-flops  for  the  scan  register  are  not  needed.  The  scan  register  is  implemented  directly  on  the  original  flip-flops 
in  the  design.  Of  course,  one  still  needs  additional  pins,  shift-in,  shift-out,  shift-clock  and  load.  The  load  signal 
may  be  combined  with  the  system  clock,  thus  saving  a  pin.  A  scan  design  makes  all  flip-flops  completely  con¬ 
trollable  and  observable.  This  means  that  the  circuit  can  be  placed  in  any  state  and  its  next  state  completely 
observed.  Since  the  inputs  to  the  combinational  logic  either  come  from  primary  inputs  or  from  some  flip-flops, 
the  scan  allows  one  to  apply  any  test  vector  to  the  combinational  logic  portion  of  the  circuit.  And  since  the  out¬ 
puts  from  the  combinational  logic  go  to  primary  outputs  or  to  some  flip-flops,  the  responses  are  completely 
observable.  This  method  essentially  reduces  the  sequential  test  problem  to  a  combinational  test  problem.  There 
are  several  slightly  different  scan-based  structures  used  by  different  computer  manufacturers.  The  most  widely 
known  structure  is  Level  Sensitive  Scan  Design  (LSSD),  used  by  IBM  in  many  of  its  systems.  LSSD  uses  level 
sensitive  (not  edge  triggered),  hazard  free  latches.  Two  latches  form  a  master-slave  flip-flop.  It  is  estimated  that 
the  LSSD  scan  designs  have  10%  to  15%  area  overhead. 

Scan  has  also  become  a  standard  for  board-level  tesung.  When  scan  is  applied  to  the  periphery  ot  a  chip  it 
is  termed  boundary  scan.  The  goal  is  to  make  every  chip  on  a  beard  completely  controllable  and  observable  from 
the  edge  connector  of  the  board.  In  addition,  the  boundary  scan  registers  are  designed  in  such  a  way  that  they  can 
also  be  used  to  test  the  interconnect  between  chips.  At  the  board  level,  predominant  failures  are  in  the  intercon¬ 
nects  and  pins.  Therefore,  boundary  scan  is  very  useful  in  board  tesung.  Of  course  the  chips,  themselves,  nui'-l  be 
designed  for  testability.  Boundary  scan  only  provides  access  to  the  chips,  not  testability  on  the  „hip.s.  Boundary 
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scan  is  useful  only  if  all  chip  manufacturers  follow  a  standard  method  of  communication  on  board.  IEEE  has  put 
out  a  standard  for  boundary  scan  that  many  manufacturers  have  agreed  to  follow.  Further  details  on  a  variety  of 
designs  for  testability  techniques  can  be  found  in  [33, 39]. 

4.5.  Built-In  Self-Test  (BIST) 

.Built-in  self-test  is  the  capability  of  a  circuit  (chip,  board  or  system)  to  test  itself.  Most  circuits  are  tested 
by  external  testers,  which  apply  the  test  vectors  and  monitor  the  responses.  In  BIST  circuits,  the  test  vectors  are 
internally  generated  and  applied,  and  the  responses  are  also  internally  monitored.  A  general  organization  for  BIST 
is  shown  in  Figure  10.  The  test  generator  and  response  monitor  have  to  be  very  simple  to  keep  the  overhead  of 
BIST  very  small.  As  a  result,  the  test  generator  is  generally  a  counter,  which  produces  exhaustive  test  patterns,  or 
it  is  a  linear  feed-back  shift  register  (LFSR)  which  produces  pseudo-random  patterns.  The  response  monitor  is 
similarly  very  simple.  The  monitor  compresses  the  responses  into  a  single  word,  called  signature.  Compression 
methods  include,  counting  number  of  l*s  and  0’s  in  the  response  stream,  counting  0-to-l  or  l-to-0  transitions, 
forming  a  checksum,  taking  parities  and  so  on.  The  circuits  that  produce  the  signature  are  also  called  signature 
analyzers.  Signature  produced  by  the  compression  is  calculated  by  simulation  of  the  fault-free  machine  and  stored 
on  the  chip.  During  the  actual  test  of  the  circuit,  the  signature  is  produced  by  the  compressor  on  chip  and  then 
compared  with  the  stored,  good  signature.  A  mismatch  between  signatures  indicates  a  faulty  chip.  At  times,  a 
faulty  chip  produces  the  same  signature  as  a  good  signature  and  the  fault  goes  undetected.  This  can  happen  for 
two  reasons:  1)  the  test  set  failed  to  detect  the  fault;  or,  2)  the  test  set  detected  the  fault  but  the  compressor 
"lost”  the  information.  The  loss  of  information  is  always  possible  in  any  compressor.  For  example,  a  parity 
compressor  will  produce  the  same  parity  if  the  faulty  responses  have  an  even  number  of  bits  in  error.  This  situa- 
tion  is  referred  to  as  error  masking,  and  the  faulty  output  which  produces  the  same  signature  as  the  good  output  is 
called  an  alias  of  the  good  output.  Aliasing  probability  can  be  analytically  estimated  if  one  can  accurately  charac¬ 
terize  all  error  responses  of  a  faulty  circuit.  The  actual  loss  of  fault  coverage  due  to  aliasing  is  not  .-,o  easy  to  esti¬ 
mate.  A  fault  simulation  of  the  BIST  circuit  with  the  signature  analyzer  in  place,  can  accurately  give  the  loss  of 
coverage. 
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An  Organization  for  Built-In-Test 


PRIMARY  INPUTS 


PRIMARY  OUTPUTS 


I ^  ‘~i'’  Or(j*»i,u.iCiuii  Stimulus  generator  usually  is  a  .-ouiUr 

I.spjnsi.  iiiomtor  usually  is  a  signature  Analyzer  based  on  LFSil. 
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an  autonomous  LFSR. 


Of  all  the  compression  methods,  the  checksum  methods  have  proven  to  be  the  most  effective  in  practice.  A 
checksum  is  simply  an  addition  of  numbers  modulo  a  constant  A  modulo  addition  can  also  be  viewed  as  a  divi¬ 
sion  by  a  constant  in  which  the  quotient  is  discarded  and  the  remainder  is  retained.  Since  in  the  test  context  the 
numbers  themselves  have  no  specific  meaning  (i.e„  they  are  simply  some  binary  strings)  one  can  use  any  number 
system  and  any  constants,  as  long  as  the  compressor  has  low  hardware  complexity  and  low  aliasing.  Linear  feed¬ 
back  shift  registers  (LFSRs)  can  form  the  checksums  in  polynomial  algebras  over  the  canary  strings,  and  they  are 
easy  to  design.  Aliasing  probabilities  of  LFSRs  have  been  analyzed  for  specific  error  models  and  all  analyses  lead 
to  the  conclusion  that  aliasing  probability  is  for  an  n-bit  LFSR.  Experimental  studies  of  some  circuits  using 

fault  simulators  lave  shown  that  the  loss  in  fault  coverage  due  to  aliasing  is  usually  much  less  than  1 %.  More 
details  on  pseudo-random  techniques  for  BIST  can  be  found  in  [40]. 


5.  Evaluation 

The  foregoing  sections  have  outlined  a  wide  range  of  techniques  useful  for  designing  fault-tolerant  systems. 
In  any  given  situation,  the  relative  efficiency  of  these  techniques  must  be  evaluated  so  that  design  trade-offs  can  be 
made.  Such  analysts  is  an  integral  part  of  the  design  process.  The  next  two  sections  introduce  the  question  of 
evaluation  and  discuss  the  different  methods  available  to  model,  analyze  and  measure  the  dependability  of  fauit- 
tolerant  systems.  In  section  5.1.  methods  to  develop  analytical  models  for  computer  system  reliability,  availability 
and  performability  are  outlined.  A  wide  range  of  automated  tools  that  allow  an  infotmed  user  to  conduct  evalua¬ 
tions  of  complex  structures  have  been  developed.  The  characteristics  of  some  of  these  tools  are  given. 
Measurement-based  methods  for  evaluation  techniques  arc  discussed  in  section  5.2. 


5.1.  Analytical  Models 

In  this  section  -c  briefly  review  computer  system  dependability  modeling  issues.  We  first  discuss  two 


■•vidciy  used  combinatorial  models  Mi],  Then  we  address  Markov  modeis.  including  avaiiariius.  rchand.r..  .:r,d 
pcrtormability  'reward)  modeis.  Finally,  we  take  a  look  at  six  representative  software  modeim-a  tools. 
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5.1.1.  Simple  Models  for  Fault-Tolerant  Systems 

If  T  is  a  random  variable  that  denotes  the  lifetime  or  ume-u> failure  of  a  system  component  (and  t  its  partic¬ 
ular  value)  then  T  has  a  cumulative  distribution  function  (CDF)  given  by 


r(t)  =  P  (T  <  l) 

The  reliability  R(t),  of  the  component  is  the  probability  that  the  component  survives  until  time  t: 


R(t)  =  P(X>t)=l  -  F(t)  .  (2) 

Typically,  R(t)  is  assumed  to  be  an  exponential  distribution  thus  R(t)  =  exp(-<X)t) ,  where  X  is  failure  rate. 
As  explained  in  section  2,  the  elementary  reliability  models  of  fauit-ioiexant  computing  systems  are  often  varia¬ 
tions  on  the  so-called  NMR  Model  (N-Modular  Redundant).  The  system  is  composed  of  n  identical  and  indepen¬ 
dent  components,  m  or  more  of  which  must  be  functioning  far  the  system  to  be  operational.  Thus,  the  system  has 
(n  -  m)  "hot  standby"  components.  Under  these  simplifying  assumptions,  we  can  express  the  reliability  of  the 
system  as: 

f  > 

R*M*(t)  =  Z  C'V 

Special  cases  include  the  serial  system  (m  =  n),  the  parallel  system  (m  =  1),  and  the  tripie  modular  redundant  vot¬ 
ing  system  (n  =  3,  m  =  2). 


The  second  elementary  reliability  model  represents  an  N-modular  Standby  Redundant  system  f"NSRl.  it  has 
(n  -  1)  of  n  identical  components  maintained  in  a  powered-off.  (cold  standby)  state.  Upon  failure  of  the  single 
acuve  component,  one  of  the  (n  -  1)  powered-off  components  is  switched  into  operation.  It  is  assumed  that  there 
is  no  chance  of  a  failure  associated  with  switching.  The  system  lifetime  randon  variable  in  this  case  is  the  sum  of 

n  identical  component  lifetime  random  variables,  so 


where  dr  denotes  n-foid  convcluaon. 


RnsrCO  =  l-|*dFs) 

If  the  probability  of  failure  during  switching  is  taken  into  .recount,  the 


above  expression  must  be  appropriately  modified  [43]. 
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5.1.2.  Markov  Models 


Markov  models  allow  us  to  describe  complex  interactions  among  components  and  are  widely  used.  A 
discrete-state  Markov  process  is  a  collection  of  states  together  with  the  transition  rates  among  these  states.  When 
a  Markov  process  is  used  to  mode!  dependability  of  a  computer  system,  each  state  in  the  model  represents  a  dis¬ 
tinct  combination  of  operational  and  failed  states  of  individual  components  or  modules  of  the  system.  The  process 
of  failing  and  recovering  of  the  components  is  described  by  the  transition  from  one  state  to  another  in  the  Markov 
model  [42].  In  a  discrete  time  Markov  process  transitions  can  only  occur  at  fixed  intervals,  while  in  a  continuous 
time  Markov  model  transitions  can  occur  at  any  time.  A  Markov  process  has  the  property  that  the  future  state  of 
the  process  depends  only  on  its  present  state  and  not  on  the  past  (i.e.,  it  is  memoryless).  A  continuous  time  sys¬ 
tem  is  said  to  have  the  Markov  property  if: 

P(X(t  +  s)  =  j  I  X(s)  =  i.  X(u)  =  k,0Su<s)  =  P{X(t  +  s)  =  j  I  X(s)  =  i)  (1) 

where  s,  t  >  0  and  i,  j,  k  denote  the  states  of  the  model. 

If,  in  addition. 

P{X(t  +  s)  =  j  I  X(s)  =  i}  =  P{X(t)  =  j  |  X(0)  =  i)  =  Pij.  (2) 

the  Markov  process  ts  said  to  be  stationary  or  homogeneous.  In  other  words,  a  conunuous-ume,  homogeneous 
Markov  model  represents  a  time-evolving  process  that  changes  states  according  to  the  following  rules: 

(1)  The  holding  time  in  each  state  i  is  exponentially  distributed  with  mean  h„ 

(2)  Given  that  the  system  is  in  state  i,  it  goes  to  state  j  with  a  probability  (transition  probability)  pij. 

If  the  exponential  distnbuuon  (1)  above  is  not  satisfied,  (i.e.,  the  distribution  is  of  a  general  i'ormi  the  mode! 
is  said  to  be  semi-Markov.  Usually,  analytical  models  assume  that  the  holding  time  in  each  state  is  exponentially 
distributed.  From  a  practical  point  of  view,  this  assumption  can  limit  the  accuracy  of  the  model  results. 

The  details  of  the  theory  and  applications  of  Markov  models  to  reliable  systems  are  given  m  [43].  Figure  1 1 
shows  a  Markov  model  for  a  simple  system  with  two  components.  The  system  is  assumed  to  tail  if  both  com¬ 
ponents  tail  (a  l-out-ot-2  system;.  The  failures  of  the  two  components  are  assumed  to  be  independent.  There  are 
four  sta  >  in  the  model:  the  normal  state  Sn  — -  both  components  are  operauonai;  the  single  component  failure 
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state  Si  (i  =  1, 2)  —  where  component  i  has  failed;  the  system  failure  state  Sf  —  where  both  components  have 
failed.  The  X,  and  p,  denote  the  mean  failure  rate  and  recovery  rate  for  component  i,  respectively. 

S.12.1.  Availability  Evaluation 

Given  that  there  are  n  states  (l,2,...,n)  in  a  Markov  (or  semi-Markov)  model,  then,  at  any  time  t>  0,  the 
state  distribution  can  be  expressed  by  the  probability  vector 

P(t)  =  (pi(t),  P2(t) . Pn(t))  (3) 

where  p,(t)  is  the  probability  that  the  process  is  in  state  i  at  time  t  and  satisfies  the  following  condition: 

£P.(0=1.  (4) 

Further,  assume  that  the  components  in  the  system  can  be  in  one  of  two  states:  operational  or  failed.  The 
system  is  considered  operational,  or  available,  if  at  least  a  minimum  set  of  components  is  operational.  The  failure 
and  repair  process  of  these  components  can  be  represented  by  a  Markov  (or  semi-Markov)  model  with  state  space 
'V  [44,  45].  We  can  partition  the  total  set  of  states  (TO  into  an  r  rational  set  (VF0)  and  a  failed  set  (lPf).  The  pro¬ 
bability,  A(t),  that  the  system  is  operational  at  a  time  t,  is  referred  to  as  the  instantaneous  availability: 

A(0  =  £  Pi(t)  (5) 

The  interval  availability ,  Aft),  is  the  proportion  within  a  given  interval  of  time  that  the  system  is  operational.  This 

is  given  by: 

Figure  11.  Markov  Model  for  a  l-out-of-2  System. 
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(6) 


A(t)  =  I  f  A(x)dx 

The  steady-state  availability ,  A,  is  the  limit  of  the  interval  availability  as  t  goes  to  infinity: 

A  =  Jim  A(t) 

Equation  (7)  above  is  equivalent  to  the  following  commonly  used  definition  of  availability: 

a  _  MTTF 
A_  MllH-MHK 


(7) 

(8) 


5.1.2.2.  Reliability  Evaluation 

To  evaluate  reliability  based  on  a  Markov  model,  we  make  all  failure  states  (T'f)  be  absorbing  states  so  that 
once  the  system  enters  vPf,  it  is  destined  to  stay  there.  That  is,  we  modify  the  model  by  setting  all  transition  pro¬ 
babilities  out  of  a  state  in  T'f  to  zero.  For  example,  Sf  in  Figure  1 1  is  an  absorbing  state.  If  this  modified  model 
is  solved,  the  system  reliability  can  be  evaluated  as: 

R(0  =  £p.(t)  (9) 

where  4*0  is  the  operational  set  of  states  in  the  model.  It  is  seen  that  R(t)  is  a  special  case  of  the  instantaneous 
availability  when  failure  states  are  set  to  absorbing  states. 

5.1.2.3.  Performability  (Reward  Rate)  Evaluation 

In  evaluating  availability  and  reliability,  we  assume  that  a  system  is  fully-operational  (in  an  up  state) 
without  any  degradation.  If  a  system  is  allowed  to  operate  in  a  degraded  mode,  a  combined  measure  of  perfor¬ 
mance  and  availability,  called  performability  [46],  is  often  used.  Typically,  performability  can  be  evaluated  via 
reward  models  by  defining  a  reward  rate  r,  [47]  (0  <  r,  <  1)  for  state  i,  rather  than  simply  a  zero  or  a  one,  as  in  the 
case  of  availability  and  reliability  models.  Performability  measures  are  generalizations  of  availability  measures, 
they  fall  into  three  basic  classes,  namely  instantaneous  reward  rate  at  t, 

YW  =  ,£,r>P>W  -  (10) 

interval  reward  rate  over  t, 

Y(t)=  -}-|'Y(.x)dx  .  (!i) 
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and  steady-state  reward  rate. 


Y  =  lim  Y(t) . 


(12) 


5.1.3.  Modeling  Tools 

Various  software  tools  have  been  created  to  evaluate  dependability  for  computer  systems,  using  both  ana¬ 
lytic  and  simulation  techniques.  These  software  tools  are  sophisticated  and  require  a  user  with  a  good  degree  of 
expertise  in  reliability  engineering  and  computer  design.  A  summary  of  characteristics  of  six  representative  tools 
is  listed  in  Table  1  [41, 48].  All  of  these  tools  can  be  used  to  evaluate  dependability  measures  for  both  repairable 
and  nonrepairable  systems.  Most  are  based  on  Markov  models. 


52.  Measurement-Based  Analysis 

Measurement  is  an  essential  part  of  the  evaluation  process.  In  the  final  analysis,  evaluation  techniques  dis¬ 
cussed  above  must  be  supported  by  measurements  in  the  field.  A  study  of  production  systems  is  valuable  not  only 
for  accurate  evaluation  but  also  for  gaining  insight  into  reliability  bottlenecks  in  system  design.  Measurements  are 
made  through  the  different  stages  of  design,  development  and  manufacturing  and  provide  the  basis  for  gaining 


Table  1.  Summary  of  Characteristics  of  Six  Dependability  Evaluation  Tools 


Tool 

HARP(49] 

METASAN[50] 

SAVE[51] 

SHARP[52] 

SURE153]  1  SURF[54] 

Models 

Nonhomogeneous 

Markov 

Stochastic 

activity 

networks 

Fault  trees 

Continuous-state 

Markov 

Directed  graphs 
Fault  trees 

Semi-Markov 

Semi-Markov 

Markov 

Solution 

Techniques 

Runge  Kutta 
Simulation 

Gaussian  elimin. 
Iterative  method 

Simulation 

SOR* 

Randomization 

Simulation 

SOR* 

Laplace 

Computation 
of  hounds 

Laplace 

Input 

Any  failure  distr. 
Fault  tree 

Markov  chain 

Description  of 
stochastic 
activity  network 

Exp.  distr. 

Fault  tree 

Markov  chain 

Multistage  Exp. 
distr.;  Multiple 
levels  of  models 

Markov 

chain 

Transition 

matrix 

■  IE  j  1 

Availability 

Reliability  I  Reliability  !  Reliability 

Performance  <.  Availability 

Operating  !  UNIX,  VMS  !  UNIX  !  VM 

Systems  :  MS-DOS  i  j  MVS 

UNIX  '  VMS  ■  1RMTSO 

VMS 

*  SOR:  Successive  overreiaxauon 
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understanding  and  insight  into  the  system  and  the  manufacturing  process. 


In  some  instances,  measurements  arc  made  directly  on  production  systems  in  the  field  (i.e.,  uncontrolled 
phenomena).  In  other  instances,  experiments  are  devised  in  the  laboratory  under  controlled  conditions.  Faults  are 
deliberately  introduced,  and  their  impact  on  system  hardware  and  software  is  measured.  Both  approaches  have 
their  relative  advantages  and  disadvantages  and  are  used  by  manufacturers  and  researchers  as  a  basis  for  design 
and  evaluation.  The  lessons  learned  are  useful  for  developing  improved  validation  techniques  and  also  to  develop 
fault  masking  and  recovery  methods  to  lessen  the  impact  of  defects  on  the  user. 

More  than  a  dozen  years  of  research  effort  have  measured,  analyzed,  and  modeled  over  80  machine-years  of 
data.  Issues  ranging  from  the  monitoring  of  computer  reliability  to  the  analysis  of  the  measured  data  to  quantify 
system  dependability  (reliability  and  availability)  in  the  field  have  been  addressed.  Laboratory  techniques  involv¬ 
ing  a  wide  variety  of  fault  injection  techniques  ranging  from  physical  fault  insertion  to  simulation  have  been 
developed  and  tested.  The  measured  hardware  and  software  data  have  been  used  not  only  to  characterize  the  sys¬ 
tem  reliability  and  fault  tolerance  m  the  field,  but  also  to  jointly  characterize  the  interdependence  between  rehabil 
lty  and  performance.  Measurement-based  research  has  revealed  the  dependence  of  failure  rates  on  workload,  led 
to  the  development  of  improved  diagnosis  strategics,  and  has  also  contributed  to  the  development  of  accurate 
modeling  and  validation  techniques.  Finally,  such  measurements  are  crucial  in  evaluaung  the  coverage  of  different 
fault  tolerance  and  recovery  mechanisms  in  the  system. 

5.2.1.  Field  Measurements 

From  a  research  point  of  view,  field  measurements  have  provided  much  valuable  information  on  actual 
failure  characteristics  and  their  distributions.  They  provide  estimates  for  parameters  used  in  analyucal  models. 
Some  examples  are  component  failure  rates,  coverages  and  the  relative  frequency  of  different  tv  pcs  of  laults. 
Often,  the  interactions  among  hardware,  software,  and  application  programs  are  complex  and  hence  not  easily 
amenable  to  analysis.  Measurements  serve  as  an  exploratory  tool  to  understand  the  effect  of  faults  on  these 
different  system  components  and  their  interactions. 
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Specifically,  research  based  on  field  experiments  has  resulted  in  several  significant  findings.  First,  results 
have  shown  that  the  commonly  used,  simple  exponential  model  is  representative  of  only  a  small  fraction  of  system 
failures.  Second,  the  failure  distributions  are  best  characterized  by  the  Weibull  function  [4],  Depending  on  the 
failure  type,  the  hazard  function  can  be  decreasing,  increasing  or  constant.  Finally,  both  hardware  and  software 
failures  have  a  tendency  to  occur  in  bursts  [55].  Even  though  the  cause  of  the  burst  is  often  a  single  fault,  its 
effect  impacts  several  components  leading  to  multiple  errors  or  failures.  Thus,  unless  error  detection  and  diagnosis 
techniques  substantially  improve,  the  single  point  failure  assumption  common  to  many  system  design  strategies 
may  not  be  fully  justified. 

Importandy,  the  above  investigations  also  showed  that  the  dependability  of  both  hardware  and  software  was 
sigmficandy  affected  by  the  operational  environment  of  the  system.  Experimental  investigations  conducted  to 
quantify  tins  phenomenon  are  discussed  in  the  next  section. 


5.2.1.1,  Workload  Impact  on  Failure  Characteristics 

Experimental  research,  based  on  over  a  decade  of  measurements  on  several  generations  of  IBM,  DEC  and 
other  mainframes  [56,  57]  has  established  the  influence  of  the  level  and  type  of  operational  workloads  on  system 
reliability.  Measured  error  and  workload  data  from  IBM  and  DEC  systems  under  different  operational  environ¬ 
ments  have  shown  that,  on  the  average,  the  failure  rate  of  a  system  was  four  to  five  times  as  hign,  under  heavy 
workloads  than  at  low  workloads.  On  a  dynamic  level,  the  measurements  showed  that  the  risk  of  a  failure  at  high 
workloads  was  50  to  100  umes  greater  than  that  at  low  loads.  These  results  are  significant  because,  even  though 
some  (e.g.,  process  control)  computers  repetitively  execute  the  same  program  with  effectively  the  same  input 
requests,  most  have  .vidclv-varying  workloads  as  measured  by  such  metrics  as  processor  utiiizauon.  Thus,  :iic 
results  brought  into  quesuon  the  validity  of  conventional  reliability  models,  which  do  not  take  the  operational 
environment  into  account  and  hence  added  a  new  dimension  to  dependability  evaluation. 


The  dependency  of  reliability  on  workload  is  due  to  several  phenomena.  The  first  is  referred  to  as  error 
latency  As  laiiures  occur  within  a  system,  they  must  be  detected  in  order  to  affect  me  statistics.  Manv  iaiiures 
lie  dormant  rnr  latent)  until  a  particular  module  or  subsystem  is  exercised.  These  latent  rauiis  are  more  likelv  to 
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manifest  during  the  high  workload  conditions  since  an  increase  in  the  workload  implies  an  increase  in  the  state 
transitions  and  path  executions  in  the  computer.  Thus,  even  if  failures  are  not  caused  by  increased  utilization,  they 
are  revealed  by  this  factor.  Secondly,  as  system  utilization  approaches  saturation  levels,  a  statistically  higher 
software  failure  rate  results  due  to  increased  stress  on  these  programs.  Timing  and  synchronization  problems  are 
also  more  likely  to  be  revealed  at  high  workloads  and  often  these  conditions  are  difficult  to  reproduce  in  the 
laboratory.  Also,  many  load-dependent  failures  occur  in  the  area  of  code  involved  with  exception  handling.  Usu¬ 
ally,  this  section  of  code  is  not  well  debugged.  Under  high  workload  conditions,  as  critical  resources  get  saturated 
the  exception  handling  code  may  be  executed  and  reveal  software  faults  and  design  errors.  There  is  also  some  evi¬ 
dence  to  show  that  higher  workloads  result  in  higher  operating  temperatures  and  hence  in  increased  failure  rates. 

The  results  of  these  studies  are  significant.  They  indicate  that  it  is  not  useful  to  push  a  system  close  to  its 
performance  limits  (the  generally  accepted  operational  goal).  The  slight  gain  in  performance  improvement  is  more 
than  offset  by  the  degradauon  in  system  reliability.  Thus,  classical  computer  reliability  models  need  to  be  re¬ 
evaluated  in  order  to  take  system  workload  explicitly  into  account  This  research  has  had  a  strong  impact  on  the 
modeling  community.  Several  researchers,  (58,  59]  have  since  proposed  analyucal  models  that  take  workload 
variations  into  account.  The  second  impact  has  been  to  bnng  out  the  importance  of  validation  as  an  integral  pan 
of  the  modeling  process. 

5.2.2.  Measurement-Based  Models 

Given  the  above  results,  it  is  reasonable  to  ask  how  workload  parameters  can  be  taken  into  account  in  gen¬ 
erating  reliability/availability  models.  One  approach  is  to  model  the  .vorkload  as  a  daily  24  hour  c  cle  and 
assume  a  linear  relationship  between  workload  and  laiiurcs.  The  ensuirg  model  is  cyclostauonary  in  nature  and 
has  been  shown  to  represent  real  system  behavior  [57], 

Experimental  research  has  developed  methods  tor  idenufying  and  building  Markov  models  of  the  resource- 
usage/failure/recovery  process  directly  from  measured  data.  The  approach  uses  sampled  svsiem  activus  parame¬ 
ters  to  identify  headers  of  usage  which  can  then  be  identified  as  a  state  in  a  performance/reliabiliiy  mixiei  :H)j.  At 
each  interval  of  time  the  measured  workload  is  represented  by  a  point  ,n  four-dimensional  >pace  .CPU  utilization. 
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CPU  waiting  for  input/output,  I/O  controller  activity,  and  disk  activity).  A  statistical  cluster  analysis  technique 
was  used  to  divide  the  workload  into  similar  classes.  Each  cluster  was  represented  as  a  system  state,  and  a  state 
transition  diagram  with  intercluster  transition  probabilities  was  developed. 


522.1.  Software  Reliability  Evaluation 

There  has  been  a  great  deal  of  research  in  the  area  of  software  reliability  evaluation  and  a  large  number  of 
models  have  been  proposed.  By  and  large,  the  term  software  reliability  refers  to  the  manufacture  of  software.  The 
models  are  usually  empirical  in  nature  and  attempt  to  describe  the  reliability  growth  of  the  candidate  software  dur¬ 
ing  the  manufacturing,  debugging  and  testing  phases.  A  large  number  of  models  have  been  developed.  In  general 
the  models  can  be  divided  into  two  classes.  The  first  class  is  based  on  the  number  of  remaining  defects  in  the 
software.  The  simplest  such  model  referred  to  as  the  Jelinski-Moranda  model  [61]  assumes  that  the  time  to 
failure  is  proportional  to  the  number  of  remaining  defects.  Also,  perfect  repair  of  a  software  bug  is  assumed. 
There  are  a  number  of  generalizations  of  this  approach.  Imperfect  debugging,  uncertainty  in  the  projected  number 
of  initial  defects,  have  all  been  modeled  [62].  The  vast  majority  of  these  models  have  been  shown  to  be  valid  in 
their  measured  environments.  The  second  class  of  models  [63]  does  not  depend  on  knowledge  of  the  number  of 
remaining  defects  or  their  distribution.  Thus,  while  most  models  assume  that  the  failure  rate  is  a  function  of  the 
number  of  remaining  defects,  the  Liulewood-Verall  model  assumes  the  failure  rate  is  a  random  variable  with  a 
gamma  distribuuon.  Thus  the  software  reliability  becomes  a  Joubiy  stochastic  process.  The  concept  of  the  failure 
rate  as  a  random  variable  is  expected  to  treat  the  uncertainty  in  the  efficiency  of  the  bug-fixing  process.  A  com¬ 
parison  of  many  of  the  existing  models  has  been  made  by  several  researchers  [62.  64]  using  different  data  sets. 
Although  most  models  perform  well  within  their  own  contexts,  their  performance  varies  significantly  from  one 
data  set  to  another.  Thus,  no  single  model  can  be  expected  to  perform  well  under  all  circumstances.  In  other 
words,  the  question  of  deciding  a  priori  as  to  what  is  the  best  model  for  a  given  situation  remains  open  at  this 
stage.  Additionally,  few  models  address  the  question  of  operational  reliability  of  software  systems.  Studies  on  she 
impact  of  the  operaung  environment  on  software  reliability  is  given  in  [57.  65 ! 
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523.  Controlled  Experiments:  Fault  Injection 


Although  field  data  provides  a  rich  source  of  information,  an  adequate  number  of  machine  years  of  data  are 
not  always  available.  Fault  injection  is  an  important  method  to  mimic  the  occurrence  of  errors  in  a  controlled 
environment  that  can  be  instrumented  to  make  the  necessary  measurements  T66.  67],  Several  automated  tools  to 
allow  both  physical  and  simulated  faults  have  been  developed  both  in  academia  and  in  industry.  Some  of  the 
measurements  of  interest  are  latency  [67]  and  coverage  [68,  69]. 

There  are  numerous  theoretical  and  practical  difficulties  associated  with  making  measurements.  The  ques¬ 
tion  of  what  to  measure,  and  how  to  measure  it,  is  indeed  a  difficult  one.  From  a  statistical  point  of  view,  sound 
evaluations  require  a  considerable  body  of  data.  The  usual  assumptions  regarding  uniform  populations  and  sta- 
uonanty  may  not  fully  hold  in  computing  environments.  Fault-injection  experiments  have  known  input  error  dis¬ 
tributions  but  the  question  remains  as  to  how  representative  of  naturally-occurring  errors  are  those  that  are  selected 
for  injection.  The  success  of  such  experiments  depends  on  the  choice  of  fault  models,  a  realistic  workload,  and 
finally,  valid  experimental  design. 


6.  Commercial  Fault-Tolerant  Computing 


Fauit-tolcrant  computing  has  evolved  from  specialized  military  and  communications  systems  to  ceneral- 
purpose,  high-availability  commercial  systems.  The  evolution  of  fault-tolerant  computers  has  been  well  docu¬ 
mented  [4,  76].  The  earliest  high  availability  systems  were  developed  in  the  1950’s  by  IBM,  Umvac,  and  Rem¬ 
ington  Rand  for  military  applications.  In  the  1960’s,  NASA,  IBM,  SRI,  the  C.  S.  Draper  Laboratory  and  the  Jet 
Propulsion  laboratory’  began  to  apply  fault  tolerance  to  the  development  of  guidance  omputers  lor  aerospace 
applications.  The  1960’s  also  saw  the  development  of  the  first  AT&T  electronic  switching  systems. 


The  first  commercial  fauit-toierant  machines  were  introduced  by  Tandem  Computers  in  the  1970’s  for  use  :n 
on-hne  transaction  processing  applicauons  [71].  Several  other  commercial  fault-tolerant  systems  were  introduced 
m  the  1980‘s  [72!  Current  commercial  fauit-tolcrant  .systems  include  distributed  memory  multi-processors  :T.in. 
dem  VonStop  [73;.  Tolerant  Eternity  ;74|),  shared-memorv  transaction-based  systems  (Sequoia  75]..  pmr-.md- 
spare  ‘  hardware  fault-tolerant  systems  (Stratus  [76],  DEC  VAXft  3000  [75]),  and  tnple-modular-redundant 
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systems  (Tandem  Integrity  S2). 


Most  applications  of  commercial  faul  -tolerant  computers  fail  into  ths  category  of  on-line  transaction  pro¬ 
cessing.  Financial  institutions  require  high  availability  for  electronic  funds  transfer,  control  of  automatic  teller 
machines,  and  stock  market  trading  systems.  Manufacturers  use  fault  t  Jerant  machines  for  automated  Nctnrv  con¬ 
trol,  inventory  management,  and  on-line  document  access  systems.  Other  applications  of  fault  tolerant  machines 
include  reservation  systems,  government  databases,  wagering  systems,  and  telecommunications  systems. 

Vendors  of  fault  tolerant  machines  attempt  to  achieve  both  increased  system  availability  and  continuous  pro¬ 
cessing.  Depending  on  the  system  architecture,  either  processes  continue  to  ran  despite  failure;  or  the  processes 
are  automatically  restarted  from  a  recent  checkpoint  Some  traditional  systems  have  enough  rdundancy  to 
reconfigure  around  failed  components,  but  presesses  running  in  the  failed  modules  are  lost.  Vendors  of  commer¬ 
cial  fault-tolerant  systems  have  extended  fauit  tolerance  beyond  the  processors  and  disks.  To  make  large  improve¬ 
ments  in  reliability,  all  sources  of  failure  must  be  addressed,  including  power  supplies,  fans  and  inter-module  con 
necuons. 

The  Tandem  NonStop  and  Integrity  architecture,  will  be  described  to  illustrate  two  current  approaches  to 
commercial  fault-tolerant  computing. 

6.1.  Tandem  NonStop  Systems 

Tandem  NonStop  systems  are  designed  to  continue  operation  despite  the  failure  of  any  single  hardware 
component  In  normal  operation,  each  system  uses  its  major  components  independently  and  concurrently,  rather 
than  as  "hot  standbys."  Figure  1!  shows  the  architecture  of  the  NonStop  Cyclone  system.  A  s>stcm  consists  o! 
up  to  16  processors  interconnected  by  dual  busses.  Each  processor  has  its  own  memory  which  contains  a  copy  of 
the  message-based  Guardian  operaung  system.  Each  processor  controls  one  or  more  I/O  busses.  Dual-porting  ot 
I/O  controllers  and  devices  provides  multiple  paths  to  each  device.  Disks  may  be  mirrored  to  maintain  redundant 
permanent  data  storage 

NonStop.  Guardian.  Integrity  SI,  NonStop  Cyclone  and  NonStop  V*  are  trademarks  of  Tandem  Computers.  Incorrorj.-d 
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Each  module  has  self-checking  hardware  to  provide  "fail-fast"  operation  -  either  a  module  operates 
correctly,  or  it  stops  to  prevent  contamination  of  other  modules.  Faults  are  detected  by  parity  checking,  duplica¬ 
tion  and  comparison,  and  error  detection  codes.  Fault  detection  is  primarily  the  responsibility  of  the  hardware, 
while  fault  recovery  is  the  responsibility  of  the  software. 

Processes  under  Guardian  may  run  as  process-pairs.  A  primary  process  runs  in  one  processor  and  a  backup 
process  runs  in  a  different  processor.  The  backup  is  usually  dormant,  but  periodically  updates  its  state  in  response 
to  checkpoint  messages  from  the  primary.  A  checkpoint  can  take  the  form  of  a  complete  state  update,  or  as  a 
delta  checkpoint  which  communicates  only  the  changes  from  the  previous  checkpoint  Originally,  checkpoints 
were  manually  inserted  in  application  programs,  but  currently  most  application  code  runs  under  transaction  pro¬ 
cessing  software  which  provides  recovery  through  a  combination  of  checkpoints  and  transaction  two-phase  commit 
protocols. 

When  a  processor  fails,  the  failing  processor  is  identified  by  the  absence  of  periodic  "I'm  Alive”  messages. 
Guardian  directs  the  appropriate  backup  processes  to  begin  primary  execution  from  the  last  checkpoint.  New- 
backup  processes  may  be  started  in  another  processor,  or  the  process  may  be  run  with  no  backup  until  the 
hardware  has  been  repaired. 

Each  I/O  controller  is  managed  by  one  of  the  two  processors  to  which  it  is  attached.  Management  of  the 
controller  is  periodically  switched  between  the  processors.  If  the  managing  processor  fails,  ownership  of  the  con¬ 
troller  is  automatically  switched  to  the  other  processor.  If  the  controller  fails,  access  to  the  data  is  maintained 

through  another  controller. 

In  addition  to  providing  hardware  fault  tolerance,  process  pairs  provide  some  measure  of  software  fault 
tolerance.  When  a  processor  fails  due  to  a  software  bug,  the  backup  processes  frequently  are  able  to  continue  pro¬ 
cessing  without  encountering  the  same  bug.  The  software  environment  in  the  backup  processor  typically  has 
different  queue  lengths,  table  sizes,  and  process  mixes.  Since  most  of  the  software  bugs  escaping  the  software 
quality  assurance  tests  involve  infrequent  data  dependent  boundary  condiuons.  the  backup  processes  often  succeed. 

Continuous  operation  requires  the  capability  for  faulty  modules  to  be  identified,  scraped,  and  reintegrated 
while  the  system  is  on-line.  A  lault-toierant  diagnosuc  system  monitors  system  operauon.  isolates  the  most  hkeiv 


failing  module,  and  optionally  dials  a  remote  center  to  request  service.  Modules  such  as  processor  boards,  con¬ 
trollers,  disks,  tans,  and  power  supplies  may  be  replaced  on-line. 


62.  Integrity  S2 

The  Integrity  S2  illustrates  another  approach  to  fault-tolerant  computing.  S2,  which  was  introduced  in  1990, 
was  designed  to  run  a  standard  version  of  the  UNIX  operating  system.  In  systems  where  compatibility  is  a  major 
goal,  hardware  fault  recovery  is  the  logical  choice  since  few  modifications  to  the  software  are  required. 

A  diagram  of  the  Integrity  S2  system  is  shown  in  Figure  12.  The  processors  and  local  memories  are 
configured  using  triple-modular-redundancy  (TMR).  All  processors  run  the  same  code  stream,  but  clocking  of 
each  module  is  independent  to  tolerate  faults  in  the  clocking  circuits.  Execution  of  the  three  streams  is  asynchro¬ 
nous,  and  may  drift  several  clock  periods  apart.  The  streams  are  re-synchronized  periodically  and  during  access  of 
global  memory.  Voters  on  the  TMR  Controller  boards  detect  and  mask  failures  in  a  processor  module. 

Memory  is  partitioned  between  the  local  memory  on  the  triplicated  processor  boards  and  the  global  memory- 
on  the  duplicated  TMRC  boards.  The  duplicated  poruons  of  the  system  use  self-checking  techniques  to  detect 
failures.  Each  global  memory  is  dual  ported  and  is  interfaced  to  the  processors  as  well  as  to  the  I/O  Processors 
(IOPs).  Each  IOP  controls  a  NonStop  V+  bus.  Standard  VME  peripheral  controllers  are  interfaced  to  a  pair  of 
NonStop  V+  busses  through  a  Bus  Interface  Module  (BIM).  If  an  IOP  fails,  the  BIM  switches  control  of  all  con¬ 
trollers  tc  the  remaining  IOP.  Mirrored  disks  may  be  attached  to  two  different  VME  controllers. 

In  Integrity  S2,  all  hardware  failures  are  masked  by  the  redundant  hardware.  After  repair,  components  are 

reintegrated  on-line. 


The  preceding  examples  have  shown  ways  in  which  commercial  vendors  have  incorporated  fault  tolerance 
into  data  processing  systems.  Approaches  involving  software  recovery  require  less  redundant  hardware,  and  odcr 
the  potential  tor  some  software  fault  tolerance.  Hardware  approaches  use  extra  hardware  redundancy  to  allow  full 
compatibility  with  standard  operating  systems  and  to  transparently  run  applications  which  have  '-'ecn  developed  on 
other  systems.  Commercial  tautt-tolcrant  computing  will  grow  in  importance  as  companies  crow  increasinplv 
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dependent  on  the  correct  operation  of  their  computer  systems. 
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